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Abstract 


A  numerical  analysis  of  the  aerodynamic  phmomena  associated  with  the  hi^-speed 
flight  of  a  sharp-nosed,  four-finned,  hi^-fineness-ratio  missile  using  a  block-structured,  par¬ 
allel  computer  algorithm  is  presented.  'Rie  Parallel  Navier-Stokes  Three-Dimensional  (3D) 
Explicit  Missile  (PANS-3EM)  algorithm  utilizes  a  second-order-accurate,  shock-capturing. 
Total  Variation  Diminishing  (TVD)  scheme  and  incorporates  a  Baldwin-Lomax  turbulence 
model.  PANS-3EM  allows  for  extreme  flexibility  in  the  choice  of  computational  domain  decom¬ 
position  and  provides  for  a  central  “stepping-off  point”  for  subsequent  modification  to  run  on 
either  a  shared-  or  distributed-memory  computing  machine.  Developmmtal  work  consists  of 
conceptualization  and  verification  of  the  algorithm  as  well  as  parallel  performance  and  seal- 
ability  studies  conducted  on  a  variety  of  computing  platforms. 

Using  PANS-3EM,  the  aerodynamic  diaracteristics  of  the  above-mentioned  missile 
are  investigated.  Specifically,  drag  and  pitching  moment  coefficient  data  are  computed  and 
compared  against  experimental  fli^t  data  obtained  from  the  United  States  Air  Force 
Aeroballistic  Research  FaciUty  at  Eglin  Air  Force  Base.  Trends  in  the  numerical  data  agree 
with  experimental  results  reported  by  Eglin  with  the  exception  that  an  imexpected  reversal  of 
the  stability  characteristics  exhibited  by  the  flight-test  missile  at  speeds  in  excess  of  Mach 
3.75  are  not  confirmed  by  the  computer  code. 


A  NUMERICAL  STUDY  OF  HIGH-SPEED  MISSILE 
CONFIGURATIONS  USING  A  BLOCK-STRUCTURED 
PARALLEL  ALGORITHM 


7.  Introduction 


1.1  Problem  Motivation 

Modern  numerical  algorithms  capable  of  resolving  complicated  flow  structures  such 
as  shocks  and  boundary  layers  place  great  demands  on  computing  resources.  Although  full 
aircraft  configuration  Navier-Stokes  calculations  have  been  performed  by  Shang  [32]  and  oth¬ 
ers,  current  generation  supercomputers  possess  neither  the  speed  nor  the  storage  capacity  to 
peuorm  \  complete  aircraft  model  simulation  including  full-scale  turbulence,  chemical  reac¬ 
tion,  structiural/aerodynamic  coupling,  and  radar  reflectivity  models.  For  example,  depending 
on  the  numerical  method  used,  a  physical  discretization  of  between  ten  and  twenty  [6,30] 
node  points  are  required  to  adequately  resolve  the  smallest  length  scales  associated  with  a 
problem.  In  the  case  of  radar  cross-section  computations,  a  scale  simulation  of  a  1  gigahertz 
signal  illuminating  a  standard  size  U.S.  fighter  aircraft  would  require  a  grid  containing 
approximately  10,000,000  node  points  [33].  Storage  of  flow  field  and  electromagnetic  variables 
as  well  as  grid  metrics  for  each  node  would  easily  exceed  both  the  storage  and  computational 
capabilities  of  all  current  and  next-generation  computing  machines.  Furthermore,  given  the 
history  of  computing  performance  improvements  inevitably  leading  to  more  ambitious 
projects  of  research,  it  can  be  safely  assumed  that  a  computing  deficit  will  continue  for  the 
foreseeable  future.  Figure  1  depicts  a  few  of  the  computationally  intensive  fluids-related  prob¬ 
lems  that  are  of  current  interest. 

While  computing  demands  continue  to  increase,  there  are  physical  barriers  to  the  con¬ 
tinued  speed  improvements  of  traditional  serial  computing  machines.  Signals  cannot  propa- 
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speed  (FLOPS) 

Data  compiled  from  References  [24]  and  [401. 


Figure  1 :  Current  and  Projected  Computational  Requirementa 


gate  faster  than  the  speed  of  light,  and  thus  components  separated  by  some  physical  distance 
must  necessarily  have  a  finite  response  time.  Reducing  the  physical  size  of  the  components 
can  help,  but  components  can  only  be  made  so  small  before  barriers  are  reached  in  terms  of 
minimum  logic  device  sizes  and  heat  dissipation  requirements. 

One  extremely  promising  technology  which  could  help  to  overcome  these  shortfalls  is 
parallel  computing.  While  there  are  many  parallel  computing  paradigms  (Single-Instruction, 
Single-Data  (SISD);  Single-Instruction,  Multiple-Data  (SIMD);  Multiple-Instruction,  Multi¬ 
ple-Data  (MIMD);  etc.^),  all  use  multiple  processing  units  to  perform  nmultaneous  calcula- 

1.  The  reader  is  referred  to  Reference  [24]  for  a  discussion  of  the  various  parallel  conqNitiag  para¬ 
digms. 
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Data  compiled  from  References  [9],  [20],  [22],  and  [40]. 

Rguro  2:  Supercomputer  Performance 


tions.  These  calculations  can  be  performed  on  separate  parts  of  a  single  problem,  or  they  can 
be  performed  on  completely  unrelated  problems.  In  either  case,  the  effective  throughput  of  the 
computer  is  increased  since  a  larger  amount  of  computational  work  can  be  performed  in  a 
given  amount  of  time. 

In  addition  to  the  many  parallel  computing  paradigms,  parallel  computing  can  be 
implemented  at  many  different  performance  levels.  Althougl^  there  are  many  different  bendi- 
marks  for  computing  performance  [39],  a  common  metric  is  the  number  of  floating  point  oper¬ 
ations  conducted  in  one  second^  (FLOPS).  In  terms  of  sheer  computing  power,  super¬ 
computers  provide  the  highest  level  of  performance.  Until  recently,  the  superoomputing  realm 
has  been  occupied  exclusively  by  madiines  with  few  but  very  powerful  processing  units.  This 
trend  is  changing  as  is  evident  in  Figure  2  which  shows  the  capabilities  of  a  few  of  the  current 
and  projected  supercomputers.  The  number  in  parentheses  next  to  each  computer  name 
denotes  the  maximum  number  of  processors  available  on  the  madiine.  Althou^  single-pro¬ 
cessor  supercomputers  are  still  in  use,  they  are  being  replaced  by  multiple  processor  models 
which  are  capable  of  delivering  much  hi^er  performance  and  throu^put. 


1.  This  number  is  typically  measured  in  terms  of  million  floating  point  tolerations  per  second 
(MFLOPS)  or  billion  floating  point  operations  per  second  (GFLOPS). 


While  si^Mrcomputers  inarguably  occupy  the  pinnacle  of  computing  power,  they  are 
extremely  expensive  and  therefore  only  available  to  a  relatively  small  number  of  users.  Con¬ 
sequently;  another  parallel  computing  approach  is  evolving  in  the  workstation-farm  concept 
wherein  a  number  of  physically  distinct  and  possibly  heterogeneous  computing  platforms  are 
used  in  combination  to  attadi  a  computing  problem.  Software  tools  such  as  Parallel  Virtual 
Machine  (PVM)  [25],  are  designed  specifically  to  OEploit  this  type  of  parallelism.  This 
approadi  has  advantages  of  flexibility,  scalability,  availability,  and  affordability  over  the 
supercomputing  approach.  Furthermore,  recent  advances  in  workstation-level  chip  tedmol- 
ogy^  have  brought  a  large  amount  of  economical  computing  power  to  the  workstation  level. 

Despite  the  continued  performance  increase  of  computing  machines,  hardware  alone 
is  not  enough  to  solve  the  computational  deficit  Indeed,  a  powerful  machine  using  weak  algo¬ 
rithms  and  poorly  designed  computer  codes  can  never  reach  its  full  potential.  Smith  [36] 
underscores  the  importance  of  software  design  in  his  hyperbole,  *10  software  everything?  No; 
it’s  the  only  thing.”  In  order  to  fully  harness  the  power  of  parallel  processing,  it  is  necessary  to 
develop  new  algorithms  which  lend  themselves  to  parallelism;  it  is  not  enough  to  merely  use 
sequential  algorithms  on  parallel  machines. 

While  the  potential  benefits  of  parallel  algorithms  are  widely  admowledged,  there  are 
difficulties  uniquely  associated  with  parallel  algorithm  development.  Unlike  a  serial  machine 
which  typically  contains  a  single  processor  and  a  single  bank  of  main  memory,  a  parallel 
machine  can  be  configured  in  a  myriad  of  ways:  from  a  large  number  of  relatively  low-powered 
processors  (the  Thinking  Machines  CM-2,  for  example)  to  a  smaller  number  of  very  powerful 
processors  (the  Cray  Y-MP,  for  example).  Furthermore,  the  memory  architecture  can  be  dis¬ 
tributed  such  that  each  processor  contains  a  bank  of  local  memory,  or  it  can  be  shared 
whereby  all  processors  access  a  single  global  memory  bank.  In  certain  instances  such  as  the 
Thinking  Machines  CM-5,  a  single  machine  can  utilize  both  shared  and  distributed  memory 
[7].  All  of  these  ardiitectures  pose  interesting  and  challenging  problems  for  the  would-be  par¬ 
allel  programmer. 


1.  Examples  include  the  DEC  A^dia  AXP,  MIPS  R4400  and  Sun  SuperSparc  processes. 
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1.2  Parallel  Computing  and  Computational  Fluid  Dynamics  (CFD) 

Tlie  applicability  of  parallel  computing  to  CFD  was  so  obvious  from  the  early  incep¬ 
tions  that  the  CFD  iwogramming  language  was  developed  by  the  NASA  Ames  Researdb  Cen¬ 
ter  for  use  on  the  first  SIMD  parallel  computer,  the  Illiac  IV  [18].  More  reomtly,  several 
researchers  have  demonstrated  the  performance  gains  possible  in  CFD  computations  using 
parallel  computing  on  ardiitectures  as  varied  as  shared-memory  MIMD  madiines  [29]  and 
distributed-memory  massively  parallel  MIMD  machines  [14].  However,  given  the  difficulties 
associated  with  parallel  algorithm  developmmit  coupled  with  the  fact  that  very  powerful  vec¬ 
tor  madiines  have  been  very  pervasive  in  the  scientific  communis  for  quite  some  time,  the 
trend  to  embrace  parallel  computing  for  CFD  applications  has  been  slow. 

A  reexamination  of  Figure  2  may  provide  some  insist  into  the  future  direction  of  par¬ 
allel  computing  and  CFD.  In  the  past,  the  existence  of  powerful  vector  machines  like  the 
Crays  tended  to  focus  algorithm  development  work  on  codes  designed  to  be  highly  vectoriz- 
able.  Similarly,  there  is  a  synergistic  relationship  between  the  introduction  new  and  powerful, 
massively-parallel  machines  like  the  Paragon  and  the  development  of  algorithms  which 
exploit  the  inherent  parallelism  in  problems  of  interest.  The  tie  between  CFD  and  parallel 
computing,  therefore,  can  only  become  stronger. 

1.3  Numerical  Algorithm  Overview 

Given  the  potential  benefits  of  parallel  computing  and  the  demonstrated  need  for  flex- 
ibibty  and  speed  in  solving  CFD  problems,  this  study  focuses  on  the  developmoit  of  an  algo¬ 
rithm  to  simiilate  flow  fields  about  high-speed  missile  configurations  in  a  block-structured, 
parallel  fashion.  The  primary  objectives  are  twofold  and  of  equal  importance:  1)  to  develop  an 
algorithm  which  decomposes  the  computational  domain  into  a  set  of  subdomains  or  blocks 
and  then  solves  the  governing  ecpiations  in  parallel  over  the  blocks,  and  2)  to  utilize  the  algo¬ 
rithm  to  investigate  the  aerodynamic  characteristics  of  a  sharp-nosed,  four-finned,  high-speed 
missile. 

The  numerical  algorithm  used  to  solve  the  governing  equations  is  a  second-order- 
accurate  Total  Variation  Diminishing  (TVD)  sdieme  developed  by  Harten  [16]  and  modified  by 
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Yee  [47].  It  is  an  extensi<m  of  the  basic  first-ordw-accurate  Roe  sdieme  [31],  and  adiieves  its 
second-order  accuracy  by  appropriately  modifying  the  numerical  inviadd  flux  terms  and 
extending  the  Roe  sdieme’s  three-point  computational  stmicil  (in  the  one-dimensional  case)  to 
include  an  additional  two  points. 

Moran  [26]  has  implmnented  the  Harten-Yee  TVD  sdieme  in  a  computer  code 
designed  to  study  the  h4^-speed  fli^t  of  missiles.  This  code  provided  a  baseline  whidi  was 
modified  to  incorporate  domain  decompositicm  and  ultimately  resulted  in  PANS-3EM. 

1.4  Domain  Decomposition 

The  Harten-Yee  TVD  scheme  discussed  in  Section  1.3  is  iteratively  ^plied  at  the  node 
points  of  the  computational  grid  surrounding  the  missile  configuration  until  a  flow-field  solu¬ 
tion  of  desired  accuracy  is  obtained.  To  increase  the  efficiency  of  the  algorithm,  it  is  possible  to 
decompose  the  computational  domain  into  subdomains  or  blocks  so  that  calculations  can  be 
conducted  simultaneously  over  the  subdomains.  Thus,  it  is  beneficial  to  explore  approaches  by 
which  the  computati(mal  domain  can  be  decomposed. 

Several  methods  of  doxnain  decomposition  have  been  examined  by  various  research¬ 
ers.  Hammond  and  Barth  [15]  used  a  finite-volume  approadi  which  assigned  eadi  vertex  of 
the  computational  mesh  to  a  computer  processor.  Swisshelm,  Horten,  Alef,  and  others 
[41,19,1,40]  utilize  multigrid  techniques  vdudi  discr^ize  the  computational  domain  into  a 
series  of  embedd&i  grids  which  vary  in  coarseness  and  are  useful  for  accelerating  convergence 
and  damping  certain  wave  numbers  of  the  solution  which  create  unwanted  numerical  oscilla¬ 
tions  in  regions  of  abrupt  flow  changes.  Tkdlin,  Hauser,  and  Furukawa  use  a  blocking  or  zonal 
decomposition  [46,17,11]  that  divides  the  computational  domain  into  a  series  of  logically  rect¬ 
angular  blocks  which  may  or  may  not  overlap.  Parallelism  occurs  by  advancing  the  solution 
simultaneously  over  each  of  the  blocks.  In  all  of  these  schemes,  some  type  of  synchronization 
is  usually  required  among  the  parallel  processes  to  ensure  that  the  numerical  solution 
remains  stable. 

Of  the  domain  decomposition  methods  discussed  here,  the  block  structured  method 
was  best  suited  to  the  framework  of  Moran’s  computer  code  sine**  it  did  not  require  code 
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restructuring  or  grid  modification  for  implementation.  Furthermore,  with  future  expandabil¬ 
ity  in  mind,  the  method  allows  for  extreme  flodbility  in  grid  partitioning  and  solution  compu¬ 
tation.  For  example,  using  a  blocked  approach,  it  is  possible  to  simultaneously  use  the  less- 
computationally-intmisive  Euler  equations  in  blocks  where  viscous-related  phenomena  are 
negligible  while  other  blocks  incorporate  the  full  Navier-Stokes  equations. 

1.5  Missile  Configuration  Study 

Using  the  Harten-Yee  TVD  sdieme  discussed  in  Section  1.3  coupled  with  the  domain- 
decomposition  approach  discussed  in  Section  1.4,  it  is  feasible  to  conduct  an  efficient  analysis 
of  the  aerodynamic  performance  of  high-speed  missile  fli^t.  Recent  experimental  research 
conducted  by  the  United  States  Air  Force  Aeroballistic  Researdi  Facility  at  Eglin  Air  Force 
Base  [13,44]  centers  on  the  determination  of  aerodynamic  performance  of  generic,  fin-stabi¬ 
lized,  high-speed  missile  configurations.  Several  nose/fin  configurations  have  been  investi¬ 
gated  including  sharp  and  blunt  ogive  noses  with  four  clipped  delta  fins.  If  the  missile  fins  are 
considered  as  flat  plates  \  then  linearized  supersonic  flow  theory  (see  Anderson  [3],  Chapter 
11)  predicts  that  the  lift  generated  by  the  fins  is  related  to  Idach  number  by 


Since  the  lift  generated  by  the  missile  fins  decreases  with  increasing  Madi  number, 
the  pitching  moment  generated  by  the  fins  must  also  decrease.  This  trend  is  confirmed  by  the 
Eglin  ttperimental  data  through  Mach  numbers  approadiing  3.75  at  whidi  time  an  unex¬ 
pected  reversal  of  the  pitching  moment  derivative  (Cm„)  curve  occurs  whidi  has  not  yet  been 
satisfactorily  explained.  It  is  the  objective  of  this  study  to  utilize  PANS-3EM  to  study  the 
aerodynamic  performance  about  the  sharp-nosed  missile  configuration  with  spedal  interest 
devoted  to  the  stability  trends  observed  for  increasing  Mach  numbers.  Moran  and  Beran  [27] 
have  studied  the  blunt-nosed  configuration  exhaustively,  and  their  work  provides  an  excellent 

1 .  Tbe  fins  used  on  the  eiqierimental  model  woe  formed  from  thin  sheet  metal,  with  no  camber  ot 
twist 
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Figtir*  3:  IMMito  QMimlriM 

comparative  reference  to  the  work  conducted  in  this  investigation.  Both  sharp-  and  blunt- 
nosed  missile  configurations  are  depicted  in  Figure  3. 

1. 6  Document  Roadmap 

With  the  discussion  of  parallel  computing  and  its  applicabili^  to  CFD  as  a  basis, 
Chapter  2  outlines  the  system  (tf  conservati>.  n  laws  and  associated  boundary  conditions  that 
govern  the  flow  behavior  in  the  high-speed  flij^t  regime  and  discusses  the  application  of  the 
Harten-Yee  TVD  sdieme  to  the  solution  of  those  equations.  Chapter  3  focuses  on  the  specifics 
of  the  code  development  and  the  driving  factors  motivating  the  choice  of  the  domain-decompo¬ 
sition  technique  used  in  this  project.  The  numerical  results  obtained  from  the  PANS-3E^ 
code  for  the  sharp-nosed  missile  are  presented  and  discussed  in  Chapter  4  as  are  results  for 
performance  of  the  algorithm  on  a  variety  of  computing  platforms.  The  appendices  contain 
supplementary  information  on  both  the  numerical  and  code  development  aspects  of  this  inves¬ 
tigation.  Appendix  A  contains  a  derivation  of  the  nondimensional  equations  which  govern  the 
fluid  bdmvior  in  the  high-speed  fliid^t  regime.  Appendices  B  and  C  provide  additional  infor- 
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mation  on  important  parallelization  issues  and  document  the  changes  made  to  Moran’s  code 
in  the  developmmt  of  the  PANS-3EM  code.  Finally,  the  source-code  listings  for  the  key 
domain-decomposition  subroutines  are  found  in  Am>endiz  D. 
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IL  Numerical  Algorithm 


11118  chapter  presents  the  equations  and  boundary  conditions  which  govern  the  phys¬ 
ics  of  the  flow  field  arwind  the  high-speed  missile.  With  the  equations  suitably  expressed,  an 
integral  formulation  is  obtained  which  lends  itself  to  solution  via  a  finite-volume  teduuque.  A 
discussion  of  the  Karten-Yee  TVD  algorithm  whidi  is  used  in  copjuncdon  with  the  finite-vol¬ 
ume  technique  follows.  Finally,  the  features  of  the  algorithm  which  allow  for  parallelism  are 
presented. 

2.1  Equation  Development 

2.1.1  Governing  Equations 

Ihe  Navier-Stokes  equations  are  statements  of  conservation  laws  of  mass,  momentum 
and  energy.  In  conservative  vector  form  for  a  three-dimmsional  Cartesian  coordinate  system, 
they  can  be  stated  as  [2] 


where 


U  = 


P 

pu 

pv 

pw! 


£  = 


pa 

PM^  +  P-T 


XX 


p«v-x 


xy 


p«w-x 


xz 


L(^f-^P)«-Xxx«-XxyV-X„w  +  ^^J 


(2) 


(3) 


(4) 


10 


pv 

PMV-  I 


F  = 


yjt 


pv  +p-T 


yy 


pvw  -  T 


yi 


(E,  +  p)v-x  u-x  V - 

V  f  yx  'vv 


yy 


X 

yz 


«yj 


pw 


G  = 


pvw-x^ 
pw^  +  p-x^^ 

(E,+p)  W-X^u-  x^v  -  x^^w  + 


2  2  2 
„  ,  u  +v  +w  . 

Et  =  P<«+ - 9 - > 


and 


Qx  =  (-Kr)^ 

«y  =  <-’'^>y 


(5) 
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(9) 

(10) 

(11) 


The  conservative  form  of  the  equations  is  used  since  the  numerical  solution  to  these 
squi^tions  is  expected  to  contain  shocks  and  other  structures  characterized  by  abrupt  flow 
property  changes  [2,  pg  51].  The  vector  U  contains  the  conserved  variables  while  the  vectors 
E,  F,  zmd  G  are  the  flux  vectors. 

After  nondimensionalization^,  equation  2  becomes 


+£/.  +F*,  +C*.  =0 


(12) 


1.  See  Appendix  A  for  a  description  of  the  nondimensionalizadon  procedure. 
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Q*  -  T* 

'•V  *  V 

(20) 

g*  =  - 

(21) 

(Y-  DMiRePr  ^ 

Equation  12  represents  a  system  of  five  equations  and  seven  unknowns.  Thus,  two 
additional  equations  are  required  to  provide  closure.  These  are  obtained  from  the  assump¬ 
tions  of  a  calorically  and  thermally  perfect  gas  and  yield  the  algebraic  relations 

p*  =  (Y-l)p*e*  (22) 


and 


(23) 


Since  the  form  of  the  nondimensional  equations  is  identical  to  that  of  the  dimensional 
equations,  the  superscript  is  dropped  for  convenience  with  the  realization  that  the  param¬ 
eters  Reynolds  number,  Mach  nuxnber,  and  Prandtl  number  make  their  appearance  only  in 
the  nondimensional  case. 


2.1.2  Finite  Volume  Formulation 

Equation  12  can  be  integrated  over  an  arbitrary  volume  to  yield 

J  U^V+  j  (£^  +  fy  +  Cp  dV=0  (24) 

Application  of  the  Divergence  Iheorem  to  the  second  integral  gives 

^U,dV+^H  hdS^Q  (25) 

V  s 

where  HmE!i+  F)  +  Gk  and  5  r^resents  the  surface  boimding  the  volume,  V. 
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Figura  4:  Finite  NtelimM  Etemant 


If  the  grid  is  fixed  in  time,  then 

^Jl/dV+^H  /ldS  =  0  (26) 

V  5 

In  a  finite-volume  methodology,  equation  26  is  applied  to  eadi  volumetric  element  A 
typical  such  element  is  depicted  in  Figure  4.  Calculation  of  the  fluxes  through  faces  of  this  ele¬ 
ment  proceeds  by  examining  the  shaded  face.  'Hie  area  of  this  face  can  be  calculated  by  com¬ 
puting  the  cross  product  of  the  face  diagonals,  £31  and  R24, 

=  l(«3ix«24)  (27) 

where 

*31  =  -  Xi)  J  +  (y3  -  yx)j  +  (zj  -  Zi)  k  (28) 

*24  =  (-*2  -  •*4)  « +  (^2  “  +  (^2  “  ^4^  ^ 

Numerical  svdbscripts  on  the  surface  vectors,  hdS,  designate  the  element  face  while  munerical 


14 


siibscripts  on  the  position  vectors,  R,  designate  the  element  vertices. 

The  x-component  of  the  numerical  flux,  f,  through  face  1  is  thus  given  by 

/lj^  =  £(/ld5,  /)  (30) 

where  the  numerical  subscript  again  designated  the  element  face  and  the  subscript  x  desig¬ 
nates  the  flux  direction.  Similarly,  the  fluxes  in  the  y  and  x-directions  through  this  face  are 


=  F(hdS^  j) 

(31) 

=  G(ftdS^  k) 

(32) 

Repeating  this  process  for  the  remaining  five  faces  of  the  volumetric  element  yields  a 
total  of  eighteen  projected  area  terms  multiplied  by  an  appropriate  flux  component.  In  prac¬ 
tice,  it  is  not  necessary  to  compute  all  terms  for  each  cell,  since  adjacent  cells  share  common 
faces.  In  fact,  values  on  three  of  the  six  faces  are  computed  which  yields  nine  flux  components 
computed  for  each  cell. 

According  to  \^okur  [43],  the  volume  for  a  general  hexahedral  cell  can  be  computed 
as 

V=^(fldSi+  hdS2  +  hdS^)  Rj i  (33) 

Vinokur  also  notes  that  the  quantity  V  is  equal  to  the  Jacobian,  J. 

With  the  cell  volume  determined,  a  first-order-accurate  discretization  of  the  temporal 
term  and  a  finite  volume  discretization  of  the  surface  integral  of  equation  26  gives 

-  T  X  m 

m  =  I 

where  yk  designates  the  volume  element  and  m  is  the  index  of  summation  over  the  faces  of 
the  element.  In  obtaining  equation  34,  it  has  been  assumed  that  the  flow  properties  are  con¬ 
stant  over  the  interior  of  the  volumetric  element. 

Wang  [45]  notes  that  solution  to  equation  34  yields  the  flow  variables  at  the  cell  cen- 


15 


ngur*  S:  Gtioat  C^ls  Along  Stagnation  Lina 

ten  at  time  level  n+i.  This,  however,  requires  the  determination  of  the  fluxes  at  the  cell  faces. 
The  method  used  in  this  study  to  compute  the  fluxes  is  the  Hartmi-Yee  TVD  scheme  which  is 
discussed  in  Section  2.2. 

2.1.3  Boundary  Conditions 

Boundary  condition  implementation  is  straightforward.  Since  the  flow  is  supenonic  at 
the  inlet  and  outlet  of  the  domain,  freestream  conditions  are  enforced  at  the  inlet  and  a  super¬ 
sonic  *^o  change”  boundary  condition  is  specified  at  the  outflow  plane.  Ihe  no-slip  condition  is 
enforced  on  the  missile  body  for  solutions  computed  with  the  Navier-Stokes  equations  while 
Euler  solutions  use  the  impermeability  and  flow  tangen<y  conditions  at  the  surface.  The 
boundary  condition  on  pressure  is  specified  as  a  zero  normal  gradient,  and  the  wall  tempera¬ 
ture  is  specified  at  a  constant  value.  These  boundary  conditions  are  consistent  with  those  uti¬ 
lized  by  Moran  [271. 

Specification  of  zero  flux  along  the  stagnation  line  is  handled  via  a  reflection  condition 
through  the  use  cX  ghost  cells.  In  the  case  of  zero  angle  of  attack,  circumferential  and  tangen¬ 
tial  velocity  components  are  reflected  across  the  stagnation  line  while  pressure,  density  and 
axial  velocity  values  are  set  equal.  At  angle  of  attack,  an  extrapolation  technique  is  used.  A 
sample  set  of  g^ost  cells  situated  along  the  stagnation  line  appears  in  Figure  5. 
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2.2  Harteti'Yee  TVD  scheme 

Harten  [16}  proposed  that  the  invisdd  flux  components  of  equation  2  could  be  modi¬ 
fied  to  3deld 


where  E,  F,  and  G  represent  the  modified  inviscid  flux  vectors,  R,  S,  and  T  represent  the  vis¬ 
cous  flux  vectors,  and 


A.  = 


A/ 


(36) 


Here,  V  is  taken  as  the  volume  of  the  finite  volume  element.  The  modified  inviscid  flux  vectors 
are  defined  as 


E 


2*^*  ^ 


1 

2 


i+2’jk 


N 

> 


f  > 

1 

^2 

f  ^ 

(V  , 

J 

^  i+2'Jk  / 

1  P  1 


(37) 


where  the  matrix  R  contains  the  eigenvectors  associated  with  the  modified  wave  speeds,  and 

the  term  P  “acts  to  limit  the  characteristic  variable,  thereby  providing  higher  accuracy  [27].” 

The  terms  4  >  ^  >  and  ^  are  HiB  geometric  terms  of  the  finite  volume  formulation  and  repre- 

^  y  ^ 

sent  the  projected  areas  of  the  element  face  into  the  x,  y,  and  z  coordinate  planes. 

Equation  37  calculates  fluxes  at  a  cell  face  by  using  information  available  at  the  cell 
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centers.  This  is  known  as  a  non-MUSCL  approach.  In  contrast,  a  MUSCL  approach  calculates 
fluxes  at  cell  faces  by  extrapolating  informaticn  from  the  cell  centers  and  evaluating  flux  val¬ 
ues  appropriately.  Tlie  reader  should  note  that  the  invisdd  fluxes  of  equation  37  are  upwind 
differmced,  while  the  viscous  components  of  equation  35  are  centrally  differenced. 

2.3  Inherent  Parallelism 

Parallelism  can  generally  be  achieved  when  the  following  conditions  are  met  [35,42]: 

•  llie  code  works  correctly  in  a  serial  implementation 

•  Hie  code  is  limited  by  the  Central  Processing  Unit  (CPU)  and  not  by  other  factors 
such  as  Input/Output  (I/O),  bus  bandwidth,  etc. 

•  The  code  contains  portions  which  are  independent  of  one  another  (no  control 
dependence) 

•  The  code  portions  which  are  independent  do  not  attempt  to  write  calculations  into 
the  same  ou^ut  variables  (no  data  dependence) 

The  first,  second,  and  fourth  of  these  issues  can  be  easily  verified  by  examining  the 
code  to  be  modified.  The  third  issue  requires  a  more  fundamental  examination  of  the  code’s 
underlying  algorithm.  For  the  case  of  the  TVD  scheme  used  in  this  investigation,  the  compu¬ 
tational  atencU  is  depicted  in  Figure  6.  This  stencil  is  comprised  of  those  node  points  that 
enter  into  the  flow  variable  calculations  at  a  givoi  node  point.  Off-axes  points  in  the  figure  are 
bound  by  the  shaded  box  and  are  represented  by  the  unfilled  drdes.  These  stencil  points  orig¬ 
inate  firom  the  central  differencing  of  the  viscous  terms  in  equation  35.  Darkened  circles  rep¬ 
resent  those  points  which  originate  from  the  flux  calculations  of  the  Harten-Yee  TVD  scheme. 
These  points  can  be  decoupled  through  the  use  of  approximate  factorization  [2]  which  results 
in  the  three  separate  one-dimensional  stencils  shown  in  Figure  7.  These  stencils  are  applied 
separately  to  the  node  points  in  a  series  of  sweeps,  one  such  sweep  along  eadi  coordinate 
direction.  Upon  completion  of  the  third  sweep,  the  solution  has  advanced  to  the  next  time 
level.  Examination  of  equation  34  reveals  that  the  scheme  is  explicit  in  that  calculations  at  a 
given  node  point  use  only  known  information  firom  a  previous  time  step.  This  explicit  nature 
allows  for  a  complete  decoupling  of  the  calculations  at  each  node  point.  In  theory,  givai 
enou^  processors,  all  node  values  can  be  calculated  simultaneously.  In  practice,  given  the 
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Hgtir*  6:  Coinputational  Staneii 

extremely  large  number  of  node  points  in  a  typical  computational  mesh,  calculaticms  are  con¬ 
ducted  in  parallel  over  lines  of  grid  points.  This  technique  is  especially  amenable  to  imple¬ 
mentation  on  a  vector  processor  since  identical  calculations  are  performed  at  each  node.  This 
investigation  develops  an  algorithm  in  whidi  the  calculations  are  performed  in  parallel  over 
blocks  of  grid  points.  The  utility  of  such  a  scheme  is  discussed  in  Chapter  3. 


Figure  7:  Decoupled  Sweepe 
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III.  Domain  Decomposition 


\ 


71ii8  section  discusses  the  domain-decomposition  approach  used  in  this  project  Issues 
affecting  the  decomposition  approach  such  as  load  balancing  and  inter-domain  communica¬ 
tion  are  identified.  General  features  of  PANS-3EM  that  are  of  importance  to  future  modifica¬ 
tion  efforts  are  explained. 

3.1  B(Kkground 

As  mentioned  previously,  the  method  of  domain  decomposition  can  take  one  of  several 
forms.  Since  the  computer  code  to  be  parallelized  was  already  complete,  validated,  and  in  use, 
the  chosen  method  of  domain  decomposition  required  easy  incorporation  into  the  firamework 
of  the  existing  code.  Furthermore,  since  the  target  machine  of  implementation  was  not  fixed, 
it  was  desired  to  develop  a  decomposition  approach  whidi  was  general  enough  to  run  on  a 
variety  of  computing  platforms  with  few  modifications.  Figure  8  shows  a  sketdb  of  the  objec¬ 
tive  of  the  PANS-3EM  development  process.  PANS-3EM  provides  a  stepping-off  point  for  a 
subsequent  modification  or  port  to  either  a  distributed-  or  shared-memory  computing  plat¬ 
form.  As  discussed  in  the  Section  1.1,  implementation  on  these  two  architectures  requires 
very  different  programming  approaches,  and  thus  a  code  that  can  be  modified  for  use  on 
either  platform  provides  a  potential  benefit  in  terms  of  reduced  code  development  time. 

A  block-structured  domain-decomposition  approadi  provides  for  ready  implementa¬ 
tion  on  either  platform.  Furthermore,  the  size  and  number  of  blocks  can  be  readily  adapted  to 
fit  within  the  memory  limitations  on  a  given  machine.  This  is  especially  useful  in  the  case  of  a 
distributed-memory  machine  in  which  the  data  must  be  rigidly  partitioned  among  the  avail¬ 
able  processor  memories. 

When  implementing  a  block-structured  algorithm,  two  key  areas  must  be  addressed: 


load  balance  and  inter-block  commimication. 


FIgm  •:  Cod*  OttvalopiiMfit  Modal 

3.1.1  Load  Balance 

In  a  paraUal  coda  asaeution,  tha  run  time  of  the  program  can  be  no  faster  than  that  of 
the  slowest  process.  Thus,  in  order  to  adbieve  optimal  turnaround  time^,  the  work  load  must 
be  divided  among  the  processors  such  that  all  processors  share  as  near  equal  portion  of  the 
computational  task  as  possible.  For  the  TVD  scheme  used  in  this  investigation,  the  amount  of 
computational  effort  required  at  each  node  is  approximately  equal.  Thus,  in  order  to  achieve  a 
proper  load  balance,  the  number  of  nodes  assigned  to  eadi  subdomain  must  be  approximately 
equal.  Given  the  most  general  block  decomposition  scenario,  it  may  be  possible  to  partition 
the  domain  such  that  each  block  contains  very  nearly  the  same  number  of  node  points.  How¬ 
ever,  in  order  to  speed  development  of  PANS-3EM,  the  decomposition  process  was  performed 
such  that  no  block  was  allowed  to  have  more  than  a  single  nei^boring  blodc  along  a  given 
blodc  face.  This  restriction  forces  blodc  boundaries  to  be  aligned  and  also  implicitly  requires 
nei^boring  blodc  faces  to  contain  the  same  number  of  computational  node  points.  Valid  and 
invalid  decomposition  examples  are  given  in  Figure  9.  The  reader  should  note  that  the  shaded 

1.  The  author’s  definition  of  turnaround  time  is  the  physical  time  elapsed  between  submission  of 
the  computing  job  and  retrieval  of  the  cmnputed  solution. 
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Rgur*  9:  Oomain  DaeomposMon  ExamplM 


face  in  the  invalid  decomposition  sketch  borders  on  two  block  faces  which  violates  the  stated 
restriction.  This  restriction  is  not  unique  to  this  project;  Hauser  also  implements  a  similar 
restriction  during  domain  decomposition  [17]. 

3.1.2  Inter-block  communication 

A  division  of  the  computational  domain  into  subdomains  or  blocks  necessitates  the 
communication  of  information  across  adjoining  subdomain  faces.  This  communication  is  part 
of  the  parallelization  overhead  and  detracts  firom  the  performance  of  the  parallel  algorithm 
since  processors  must  expend  clock  cycles  either  communicating  or  waiting  to  c*  mmunicate 
instead  of  carrying  out  the  desired  calculations.  For  the  TVD  sdieme  used  in  this  project,  the 
two  factors  affecting  the  amount  of  information  whidi  must  be  communicated  across  the 
blodts  are  the  logical  size  of  the  block  face  in  terms  of  number  of  nodes  and  the  size  of  the 
computational  stencil.  These  two  factors  are  illustrated  in  Figure  10.  Since  the  computational 
stencil  is  five-points  wide  along  any  of  the  coordinate  directions,  points  occupying  the  points 
either  on  a  block  boundary  or  immediately  adjacent  thereto  must  necessarily  use  information 
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Figura  10:  inta^doiMln  Comiminiealion  llod«l 


from  nodes  within  an  adjacent  block.  While  the  size  of  the  blod:  face  depends  on  the  logical 
partitioning  of  the  computational  domain,  the  size  ct  the  computational  stencil  is  fixed  due  to 
the  nature  of  the  TVD  scheme.  From  the  figure,  it  is  obvious  that  the  number  of  node  points 
along  any  fiioe  which  require  information  from  an  adjacent  block  is  n  z  m  z  t. 

While  relatively  unimportant  in  a  shared-memory  implementation,  the  number  of 
nodes  occupying  a  face  of  a  block  becomes  very  important  in  a  distributed-memory  implemen¬ 
tation.  This  is  due  to  the  fact  that  all  across-blodc  references  must  be  handled  through  the 
explicit  use  of  message  passing.  Reference  [21]  provides  an  excellent  discussion  on  message 
passing  for  a  distributed-memory  architecture.  As  pointed  out  by  Braaten  [5],  "the  key  factor 
for  minimizing  communication  costs  is  to  maximize  the  ratio  of  computation  in  each  processor 
to  the  communication  between  processors.”  Braaten  also  notes  that  the  computational  work 
within  a  block  is  proportional  to  the  number  of  nodes  in  a  block  while  the  communication 
work  is  proportional  to  the  number  of  nodes  on  block  boundaries.  It  is  therefore  desirable  to 
maximize  the  internal  node  to  boundary  node  ratio.  Since  no  communication  occurs  on  block 
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Hgura  11:  ComputatiofMl  Block  and  Buffar  Anaya 


boundaries  whidi  are  part  of  the  physical  problem  (i.e.,  those  boundaries  which  are  handled 
throufl^  the  mathematical  boundary  conditions  of  the  problem),  these  boundaries  should  not 
be  considered  in  the  ratio  of  internal  to  facial  nodes  within  a  block. 

The  key  data  structure  normally  used  [5,17,46]  when  implementing  the  interblodc 
communication  model  is  a  set  of  buffer  arrays  are  of  dimension  nxmxt  where  n  and  m 

are  the  dimensions  of  a  block  face  and  t  is  the  number  of  points  from  whidi  information  must 
be  communicated  across  blodi  faces.  The  dimoisions  crudal  to  the  interblodc  communication 
appear  in  Figure  10  while  Figure  11  shows  a  general  computational  block  surrounded  by  six 
buffer  arrays. 

While  the  buffer  arrays  are  generally  quite  effident  in  terms  of  computational  speed, 
it  is  obvious  that  there  is  a  storage  penalty  levied  by  their  use.  For  example,  a  single  buffer 
array  for  an  82-node<by-35-node  block  face  and  a  five  point  computational  stendl  requires 
5740  storage  elements.  Using  buffer  arrays  in  the  domain  decomposition  diosen  for  this 
investigation  would  require  approximately  22  megabytes  of  storage  which  translates  to  a  19% 
additional  memory  requirement  above  the  serial  version  of  the  computer  code.  Because  the 
serial  version  of  the  code  requires  approximately  115  megabytes  of  storage  and  since  nearly 
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all  locally-available  computing  platforms  possess  a  TnaTimnm  of  128  megabytes  of  storage 
capacity,  the  use  of  buffer  arrays  in  the  parv.ilelization  process  was  deemed  unacceptable. 
Instead,  a  block  table  was  constructed  whidi  managed  the  flow  of  information  between  the 
blocks  with  no  storage  overhead.  This  data  structure  is  discussed  more  fiilly  in  Appendix  B. 

3.2  Selection  of  Computational  Subdomains 

PANS-3EM  is  designed  to  allow  for  an  arbitrary  number  of  computational  blocks  with 
a  decomposition  performed  along  any  or  all  of  the  computational  coordinate  directions.  How¬ 
ever,  as  mentioned  in  Section  3.1.1,  for  this  investigation,  decomposition  was  performed  along 
a  single  coordinate  direction.  The  choice  of  the  axis  of  decomposition  was  motivated  by  two 
primary  factors:  ease  of  incorporation  into  the  existing  code  framework,  and  inter-block  com¬ 
munication. 

The  computational  domain  used  in  this  study  consisted  of  62  x  82  x  35  nodes  in  the 
Tf,  and  ^  coordinate  directions,  respectively.  The  coordinate  system  is  depicted  in  Figure  12.  In 
this  case,  computational  blo<k8  with  a  minimum  number  of  facial  nodes  would  be  generated 
by  a  decomposition  along  the  if  coordinate  axis.  This  decomposition  would  generate  compute- 


tional  blodcs  with  62  x  35  node  blodi  faces  across  which  information  would  be  exchanged  with 
neighboring  blocks.  However,  it  was  anticipated  that  the  addition  of  a  base  flow  r^on  would 
extend  the  number  of  points  in  the  ^  direction  and  greatly  complicate  sudi  a  decomposition. 
Consequently,  a  decomposition  along  the  %  coordinate  direction  was  chosoi  since  it  provided 
for  the  optimal  combination  of  block  facial  nodes  and  integration  into  the  existing  computer 
code.  Decomposition  in  this  manner  resulted  in  2,870  elements  along  a  face  for  which  informa¬ 
tion  must  be  exchanged.  Had  the  decomposition  occurred  along  the  C  coordinate  direction, 
then  5,084  elements  would  have  appeared  along  a  face  thus  resulting  in  a  77%  greater  com¬ 
munication  requirement.  Again,  this  communication  penalty  does  not  occur  on  a  shared-mem¬ 
ory  implementation.  Decomposition  along  two  or  three  axes  would  increase  communication 
requirements  even  further  since  additional  nonphysical  boundaries  would  be  created. 

Four  computational  domains  were  used  throughout  the  course  of  this  investigation. 
This  number  was  deemed  an  acceptable  match  for  the  computing  madiines  locally  available 
which  typically  consist  of  between  two  and  four  processing  units.  This  decomposition  resulted 
in  three  of  the  four  domains  containing  45,920  node  points  while  the  fourth  domain  contained 
40,180  node  points.  While  load  balancing  considerations  would  dictate  an  identical  number  of 
nodes  in  each  domain,  the  disparity  was  necessitated  by  the  restrictions  stated  previously.  No 
disparity  in  number  of  node  points  occurs  only  for  the  condition 

n  mod  m  =  0  (38) 

where  n  is  the  number  of  node  points  along  the  axis  of  decomposition  and  m  is  the  number  of 
Ixodes  along  that  ads. 

3.3  Implementation  of  Domain  Decomposition 

With  the  structure  of  the  computational  blocks  determined,  generation  of  the  blocks 
proceeded  by  first  blocking  a  single  two-dimensional  grid  plane,  and  then  rotating  the  blocked 
grid  plane  around  the  missile’s  ads  of  symmetry.  Figure  13  shows  a  blodced  two-dimensional 
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Figura  13: 1\iiio4)inMnslonai,  Btocind  Grid 


grid  plane  wUle  Figure  14  depicts  the  inner  and  outer  hull  of  a  three  diin«iniiii>fni1  grid.  The 
gaps  in  the  grids  represent  the  dividing  lines  between  the  computational  blo^s.  Although  the 
domains  are  of  very  different  size  in  the  physical  space,  they  contain  nearly  the  same  number 
of  computational  nodes.  The  grids  pictured  in  Figures  13  and  14  are  for  illustration  of  the 
domain-decomposition  process  and  do  not  represent  the  actual  grids  used  to  study  the  sharp¬ 
nosed  missile. 

Implementation  of  the  domain  decomposition  was  performed  by  nriliring  a  four¬ 
dimensional  array  construct.  With  this  implementation,  a  node  in  computational  coordinate 
space,  located  with  the  ordered  triple  {i,j,k),haB  a  one-to-one  mapping  to  a  node  in  the  blodc 
coordinate  space,  located  with  the  order  quadruple  {i^j'k>L).  Since  it  is  very  impractical  for  a 
programmer  or  user  to  specify  coordinates  in  blodt  space,  the  program  takes  as  input  compu- 
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tational  coordinates  and  converts  them  to  block  coordinates.  For  the  four  domain  decomposi¬ 
tion  used  in  this  project,  the  mapping  from  computational  to  block  coordinates  takes  the  form 


L  = 


I  I  +  ighost 
L  idim 


+  l 


(39) 


i'  =  i  + ighost-  (L-l)idim 


(40) 
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r  =  i 

(41) 

k'  =  k 

(42) 

where  igkost  is  the  number  of  ghost  cells  in  the  {-coordinate  direction  and  idim  is  the  number 
of  points  in  each  block  along  the  (-coordinate  direction.  The  inverse  transformation  is  given 
simply  by  solving  equation  40  for  the  computational  coordinate,  i. 

The  proper  handling  of  computaticnal  and  block  coordinates  is  fundamental  to  achiev¬ 
ing  a  working  computer  code.  The  issues  concerning  the  coordinate  transformations  are  dis¬ 
cussed  fully  in  Appendix  B. 
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IV.  Results 


Results  for  this  investigation’s  two  nuyor  areas  of  focus-algorithm  development  and 
aerodynamic  analysis-are  presented.  Validation  and  performance  results  for  the  PANS-3EM 
code  are  provided.  Additional  results  which  include  lessons  learned  from  the  algorithm  devel¬ 
opment  process  and  a  road-map  for  future  parallelization  efforts  appear  in  ^pendices  B,  C, 
and  D.  In  the  aerodynamic  arena,  computed  results  for  drag  and  pitching  moment  derivatives 
are  compared  to  other  numerical  as  well  as  experimental  results.  Underlying  study  principles 
such  as  computational  grid  resolution  and  fin  modeling  issues  are  discussed.  Also  included  are 
the  results  of  the  turbulence  model  on  the  solution  behavior. 


4.1  Parallel  Program  Validation 

Validation  of  PANS-3EM  was  accomplished  by  comparing  results  to  the  serial  version 
of  the  code.  With  both  codes  run  in  serial  mode,  no  differences  were  expected  in  any  of  the 
computations.  A  total  of  four  separate  validation  runs  were  performed,  each  designed  to  test 
various  aspects  of  the  code.  A  summary  of  the  runs  and  the  test  objectives  appears  in  Table  1. 


Table  1:  PANS>3EM  Validation  Run  Sununary 


Run 

Number 

Description 

Objective 

1 

Euler  bormdary  conditions,  no 
angle  of  attack,  no  turbulence,  no 
fins 

Test  basic  code  including  metric  com¬ 
putations,  Euler  boundary  condition 
computation  routines,  and  sweep 
routines 

2 

Navier-Stokes  boundary  condi¬ 
tions,  no  angle  of  attack,  no  turbu¬ 
lence,  fins 

Test  Navier-Stokes  boundary  condi¬ 
tion  routines,  velocity  and  tempera¬ 
ture  gradient  computation  routines, 
and  fin-handling  loops 

3 

Navier-Stokes  botmdary  condi¬ 
tions,  angle-of-attack,  no  turbu¬ 
lence,  fins 

Test  angle-of-attack-related  data 
structures 

4 

Navier-Stokes  boundary  condi¬ 
tions,  angle-of-attack,  turbulence, 
fins 

Test  turbulence  routine 

Hie  validation  runs  were  performed  using  a  62  x  82  x  35  computational  grid.  With  the 
exception  of  run  number  1  in  which  the  flow-field  was  initialized  to  free  stream  conditions,  all 
runs  were  started  from  a  partially-developed  solution.  Tliis  facilitated  a  more  thorough  test  of 
the  code  since  errors  in  gradient  calculations  were  difficult  to  detect  for  a  relatively  uniform 
fiowfield.  Validation  runs  were  performed  using  100  iterations.  Norm  histories  for  validation 
runs  1  and  4  are  depicted  in  Figures  15  and  16.  Data  for  runs  2  and  3  are  not  presented  since 
their  behavior  is  similar  to  run  4.  Tabular  data  for  selected  iterations  also  appears  in  each  fig¬ 
ure  in  order  to  verify  that  the  computed  norms  are  in  fact  idmtical. 

In  addition  to  the  norm  results,  comparisons  were  performed  on  the  conserved  vari¬ 
ables  output  from  each  code.  The  FORTRAN  output  format  specification  utilized  was  el8.12. 
With  a  62  X  82  X  35  computational  grid,  each  output  file  contained  approximately  18  mega¬ 
bytes  of  numerical  data.  Files  were  then  compared  using  the  UNIX  di/f  command.  In  all  cases, 
all  computed  values  were  identical. 

4.2  PANS-3EM  Performance  Testing 

The  performance  of  PANS-3EM  was  assessed  through  comparative  timing  runs  on 
several  shared-memory  platforms.  Table  2  contains  a  summary  of  the  platforms  used  along 
with  brief  descriptions  of  the  relevant  hardware  features  of  each  platform. 

Table  2:  Code  Test  Platforms 


Platform 

Main 

Memory 

(Mbytes) 

Number 

of 

Processors 

Processor 

Type 

Parallel  Implementation 

Sun  4/690 

600 

4 

Scalar 

none 

Digital  Equipment 
DEC  4000 

360 

2 

Scalar 

none 

Silicon  Graphics 
Power  Series  4D/ 
480 

128 

8 

Scalar 

Power  FORTRAN 
Accelerator 

Convex  C220 

128 

2 

Vector 

Compiler  Optimization 

31 
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Flgiir*  1 5:  Cod»  yhMMon  for  T»st  Cas*  1 


4.2.1  Sun  4 1690 

Althou^  this  madiir';  has  four  processors,  no  parallelization  tools  or  parallelizing 
compiler  directives  were  available.  Timing  runs  were  conducted  on  this  machine  due  to  the 
large  amount  of  main  memory  which  eliminated  any  memory  paging  dfects  that  could  bias 
the  timing  results. 

4.2.2  Digital  Equipment  DEC  4000 

Version  6.0  of  the  Digital  Equipment  (DEC)  FORTRAN  compiler  was  incapable  of  gen¬ 
erating  olgect  code  for  a  parallel  implementation.  This  situation  is  currently  being  addressed 
by  DEC,  and  when  the  problem  is  resolved,  PANS-3EM  will  be  implemented  in  parallel  on 
this  machine.  Timing  results  for  serial  execution  are  presented  for  future  comparative  pur¬ 
poses. 


4.2.3  Silicon  Graphics  4DI480 

Of  the  platforms  tested,  the  Siliom  Graphics  FORTRAN  compiler  provides  the  great¬ 
est  flexibility  in  the  parallelization  process.  Power  FORTRAN  Accelerator  (PFA),  a  Silicon 
Graphics  code  optimizing  preprocessor,  perforxns  loop-level  optimizations  on  source  code  files. 
More  explicit  control  is  available  throuid^  the  use  of  operating-system-level  function  calls  and 
compiler  directives.  These  tools  provide  the  option  of  a  more  complete  domain  decoupling  in 
the  PANS-3EM  code  at  the  expense  of  a  much  greater  degree  of  programming  effort.  Timing 
runs  presented  here  were  obtained  with  the  use  of  PFA. 

4.2.4  Convex  C220 

This  madiine  was  the  only  vector  machine  tested.  The  performance  penalty  associ¬ 
ated  with  the  PANS-3EM  code  was  expected  to  be  especially  harsh  since  the  modifications 
required  in  order  to  implement  the  domain  decomposition  severely  reduced  the  code  vectoriz- 
ability.  No  explicit  parallelizing  tools  were  available  for  this  machine  and  consequently  a  com¬ 
piler  directive  was  used  to  accomplish  the  parallelism. 
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4.3  Performance  Data 

In  all  performance  tests,  both  versions  of  the  code  were  compiled  with  identical  com¬ 
piler  flags.  For  the  Silicon  Graphics  and  Convex  runs,  the  performance  times  are  thus  indica¬ 
tive  of  the  d<^ee  of  efficiency  with  which  the  respective  compilers  are  able  to  parallelize  each 
version  of  the  code. 

Table  3  contains  the  results  of  the  comparative  performance  testing  of  PANS-3EM  and 
the  serial  code.  As  expected,  in  all  cases  the  PANS-3EM  code  is  slower  than  the  original  code. 
Since  no  parallelism  or  optimization  was  employed  on  the  Sun  or  Digital  E<iuipment 
madiines,  the  performance  difference  is  due  exclusively  to  the  additional  code  required  to 
implement  the  domain  decomposition.  The  DEC  madiine  relies  heavily  on  data  and  instruc¬ 
tion  caching  to  adiieve  maximum  performance;  the  greater  reduction  in  the  performance 
numbers  on  this  madiine  indicates  that  the  new  code  does  not  lend  itself  to  the  employed 
caching  algorithms  as  well  as  does  the  serial  code.  An  examination  of  the  coding  methodology 
as  described  in  Appendices  B  and  D  reveals  that  the  domain-decomposition  algorithm  does 
not  affect  the  order  of  complexity  of  the  serial  code.  However,  there  is  an  increase  in  operation 
count  that  is  reflected  in  the  performance  numbers. 


Ihble  3:  Pcifornuuice  Metrics 


Platform 

Processors 

Used/ 

Available 

Serial 
Code  CPU 
llme(sec) 

PANS- 
3EMCPU 
lime  (sec) 

Percent 

Difference 

Sun 

1/4 

2293.9 

2347.5 

-2.3 

Digital  Equipment 

1/2 

482.0 

533.0 

-10.6 

Silicon  Graphics 

S/8 

2926.8 

3372.7 

-15.2 

Convex 

2/2 

320.6 

420.6 

-31.2 

The  15  percent  performance  reduction  of  PANS-3EM  over  the  serial  version  of  the 
code  for  the  Silicon  Graphics  implementation  is  composed  of  two  parts:  one  part  correspond¬ 
ing  to  the  extra  operation  coimt  and  another  part  corresponding  to  the  less  efficient  loop-level 


35 


paraUeUnii  implomented  by  the  compiler.  Ibe  four-dimeneional  array  constructs  used  in  the 
domain-decomposition  approadi  require  an  additicmal  looping  structure  in  the  computer  code. 
Ihe  presence  of  this  additional  nested  loop  coupled  with  the  necessary  across-blodt-referwnce 
diecks  results  in  an  added  level  of  complexly  which  is  not  easily  handled  by  the  parallelixing 
compiler. 

The  performance  reduction  trend  is  carried  even  further  in  the  case  of  the  Convex. 
This  is  directly  attributable  to  the  reduced  degree  of  vectorizability  of  the  lo<q>ing  structures 
due  to  the  domain-decomposition  process.  The  issue  of  vectorizability  is  fully  discussed  in 
Appendix  B,  Section  B.4.  The  Convex,  a  shared-memory  vector  ardritecture,  was  not  a  target 
madiine  of  implementation,  and  consequently  the  domain-decomposition  approach  was  not 
optimized  for  this  machine.  The  reader  should  note  that  vastly  different  performance  num¬ 
bers  are  expected  for  a  distributed  memory  vector  processing  computer  sudi  as  the  Intel  Par¬ 
agon  or  a  distributed  memory  scalar  envircmment  sudi  as  a  workstation  farm  since  these 
implemaotations  would  require  a  conqtlete  decoupling  (tf  the  computational  domains  whidh 
would  reduce  the  degree  of  loop  nesting  and  eliminate  the  reduction  in  code  vectorizability 
inherent  with  the  current  version  of  PANS-3EM. 

As  mentioned  previously,  the  reduction  in  performance  numbers  for  PANS-3EM  were 
expected  due  to  the  additional  code  required  to  implement  tibe  computational  domain  decom¬ 
position.  This  is  a  necessary  precursor  to  a  distributed-memory  implementation.  The  sli^t 
reduction  in  performance  is  more  than  offset  by  the  increased  flexibility  the  PANS-3EM 
implementati<m  provides.  Furthermore,  given  sufficient  computatimial  tools,  a  complete 
decoupling  of  the  computational  domains  can  be  conducted  which  has  the  potential  of  greatly 
improving  performance.  The  issues  associated  with  the  domain  decoupling  are  discussed  in 
Appendix  B. 

4.4  Aerodynamic  Results 

Aerodynamic  results  presented  here  focus  on  the  computed  drag  and  pitching  moment 
derivative  coefficients  for  the  sharp-nosed  missile  and  the  aerodynamic  phenomena  believed 
to  have  the  greatest  impact  on  the  observed  results.  These  factors  include  the  stagnation 
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region  preseure  which  affects  the  computed  pressure  drag,  boundary  layer  resolution  which 
affects  the  skin-friction  drag,  and  flow-field  resolution  in  the  fin  area  whidi  affects  the  pres¬ 
sure  distribution  over  the  fins  and  consequently  greatly  impacts  the  pitdiing  moment  deriva¬ 
tive  results.  A  discussion  of  eadi  of  these  issues  as  well  as  a  mention  of  the  key  computational 
issues  and  study  methodology  appears  below. 

4.4.1  Computational  Issues 

The  computational  grid  used  in  this  investigation  consisted  of  62  z  82  x  35  node  points 
along  the  q,  and  (1  coordinate  directions,  respectively.  Minimum  wall  spacing  was  fixed  at 
0.0001.  The  grid  was  designed  to  closely  approximate  the  grid  used  in  the  work  of  Moran  [27], 
and  contains  nearly  the  maximum  number  of  node  points  that  can  be  used  on  a  128  megabyte 
computing  platform.  However,  certain  aspects  of  the  results  indicate  strongly  that  the  grid 
was  not  sufficiently  refined  in  order  to  resolve  all  relevant  flow  structures.  Since  a  machine 
with  larger  memory  ciq)acity  and  sufficient  computing  power  was  not  locally  available,  no  grid 
refinement  study  was  conducted.  However,  when  the  prdblems  associated  with  parallel  imple¬ 
mentation  on  the  DEC  4000  are  fully  resolved,  this  issue  should  be  explored  in  depth.^ 

Generation  of  the  grid  was  accomplished  using  GIUDGEN  [38],  an  elliptical  grid  gen¬ 
erator  developed  for  Wright  Laboratories.  Bisymmetry  of  the  flow-field  was  assumed  (no  mis¬ 
sile  yaw),  and  hence  the  computational  domain  extended  over  only  half  the  missile.  Points 
were  clustered  in  the  stagnation  and  fin  regions  as  well  as  in  the  boundary  layer.  Figure  17 
depicts  a  two-dimensional  macroscopic  view  of  the  grid  as  well  as  a  dose  view  of  the  grid  in 
the  stagnation  region.  Instead  of  a  perfectly  sharp  nose,  a  sli^t  rounding  was  incorporated  in 
order  to  more  closely  match  the  missile  sketch  and  shadowgram  of  Reference  [13]  and  to 
improve  the  solution  performance  in  the  nose  region.  Examination  of  the  axis  scales  on  the 
inset  graph  reveals  that  the  nose  radius  is  extremely  small-on  the  order  of  2%  of  the  body 
diameter.  The  three-dimensional  grid  was  formed  by  rotating  and  duplicating  the  two-dimen- 
1.  See  Section  4.2.2. 
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Figura  17:  D«MI  of  TViio^iiMnsional  Grid  Ptano 


sional  grid  about  the  missile’s  axis  of  symmetry. 

Convergence  criteria  for  the  computer  runs  was  established  by  Moran  [28]  based  on 
previous  two-  and  three-dimensional  work  performed  using  the  Hartm-Yee  TVD  algorithm. 
The  norm  for  convergence  calculations  was  defined  using  the  equation 


1 


where  n  is  summed  over  all  interior  nodes  in  the  computational  domain  and  AU  r^resents  a 
summation  over  the  dianges  in  values  of  the  components  of  the  conserved  variable  vector,  U 
between  successive  time  steps.  The  solution  was  deemed  converged  when  the  norm  value 
given  by  equation  43  reached  1x10*^.  Approximately  36  CPU  seconds  were  required  per  itera¬ 
tion  on  the  Convex  C220  with  a  typical  case  achieving  convergence  after  8000-10000  itera¬ 
tions. 


4.4.2  MethodoU^ 

Atotal  of  20  computer  runs  were  performed.  Freestream  Mach  number  was  varied 
between  two  and  six  in  order  to  establish  trends  over  the  entire  range  of  available  experimen¬ 
tal  data.  Drag  coefficients  were  calculated  for  zero  angle-of-attadi  missile  and  include  com¬ 
puted  contributions  due  to  skin  friction  and  pressure  as  well  as  an  empirically-obtained 
contribution  [12]  due  to  the  missile  base  region.  Pitching  moment  derivative  coefficients  were 
obtained  by  assuming  a  linear  variation  in  pitching  moment  over  the  small  (five  degree)  vari¬ 
ations  in  angle  of  attack. 

The  nature  of  the  numerical  algorithm  requires  an  initial  solution  with  a  partially 
developed  shock  and  botindary  layer  structure  [27].  Accordingly,  the  flow  field  was  initialized 
to  freestream  conditions,  and  100  iterations  were  performed  with  the  Reynolds  number  set 
aitifidally  high  in  order  to  effectively  negate  any  viscous  effects.  The  viscous  terms  were  then 
“turned  on”  and  the  associated  boundaiy  layer  allowed  to  partially  develop  over  several  hun¬ 
dred  more  iterations.  Finally,  the  turbulence  model  was  activated  and  the  solution  allowed  to 
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iterate  until  convergence.  A  summary  of  the  param^rs  for  each  computer  run  appears  in 
Table  4. 


Table  4;  Sharp-Nosed  Missile  Test  Matrix 


Run 

Mach 

Angle  of 

Turbulence 

Number 

Attack  (deg) 

Model  Active 

1 

2.0 

0.0 

2 

2.0 

0.0 

✓ 

3 

2.0 

5.0 

4 

2.0 

5.0 

✓ 

5 

3.0 

0.0 

6 

3.0 

0.0 

✓ 

7 

3.0 

5.0 

8 

3.0 

5.0 

✓ 

9 

3.5 

0.0 

10 

3.5 

0.0 

✓ 

11 

3.5 

5.0 

12 

3.5 

5.0 

✓ 

13 

4.5 

0.0 

14 

4.5 

0.0 

✓ 

15 

4.5 

5.0 

16 

4.5 

5.0 

✓ 

17 

6.0 

0.0 

18 

6.0 

0.0 

✓ 

19 

6.0 

5.0 

20 

6.0 

5.0 

✓ 

4.4.3  Drag  and  Pitching  Moment  Derivative  Coefficient  Results 
Computed  pitdiing  moment  derivative  results  appear  in  Figure  18.  The  negative  val¬ 
ues  of  pitching  moment  coefficient  indicate  that  a  restoring  moment  is  being  generated  by  the 
fins.  As  freestream  Mach  number  increases  throuf^  Mach  3.5,  the  fins  lose  effiectiveness  as  is 
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evident  by  the  decreaeingly  negative  values  of  the  pitching  moment  derivative  coefficient.  At 
approximately  Madi  3.75,  the  trends  in  the  experimental  and  computed  results  diverge,  with 
the  computed  results  continuing  in  a  decreasing  fashion  while  the  experimental  results 
reverse  their  trend.  The  reason  for  the  reversal  is  still  not  understood  and  is  the  subject  of 
continuing  study  [28,44].  All  computed  results  predict  approximately  the  same  trend,  with  the 
EAGLE  code  predicting  the  most  marked  drop  off  in  Cma.  Since  EAGLE  is  an  inviscid  code,  its 
failure  to  predict  the  reversal  of  the  stability  characteristics  indicates  that  the  reversal  is 
most  likely  driven  by  viscous-related  phenom^a. 

Computed  drag  results  appear  in  Figure  19.  As  with  the  pitdiing  moment  coefficient 
calculations,  the  trends  agree  with  the  experimental  results.  However,  in  all  cases,  the  code 
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significantly  under  predicts  the  measured  drag.  As  expected,  turbulent  drag  is  higher  than 
laminar  drag  due  to  the  much  higher  skin  friction  associated  with  the  turbulent  boundary 
layer. 

Examination  of  Figures  18  and  19  reveals  a  slight  change  in  the  computed  trends  in 
the  Mach  3.0-3.5  region.  Althou^  no  unique  flow  structures  were  isolated  to  this  Mach 
regime,  it  is  possible  that  a  more  sophisticated  fin  model  and  a  more  refined  grid  may  reveal 
such  structures  to  which  the  stability  reversal  may  be  attributed. 

The  results  presented  in  Figures  18  and  19  depend  heavily  on  three  primary  regions 
of  computation:  the  stagnation  region,  the  boundary  layer  region,  and  the  fin  region.  A  discus¬ 
sion  of  the  observed  results  for  each  of  these  areas  follows. 
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Figure  20:  Stagnation  Region  Pressure  Resuits  for  M»  b  3.0 


4.4.4  Stagnation  Region  Calculations 

Typical  results  for  stagnation  region  pressure  and  density  appear  in  Figures  20  and 
21,  respectively.  Both  figures  show  the  detached  nature  of  the  shodi  as  well  as  the  compres¬ 
sion  region  that  occurs  between  the  shodc  and  the  missile  nose.  A  detached  shock  is  expected 
since  the  missile  nose  is  slightly  rounded.  The  slight  smearing  of  the  shock  evident  in  the  fig¬ 
ures  and  is  a  result  of  the  coarseness  of  the  grid  as  well  as  the  Eurtifidal  dissipation  required  to 
prevent  the  formation  of  nonphysical  solutions  in  the  region.  Both  of  these  subjects  are  dis¬ 
cussed  in  Sections  4.5.  Since  the  nose  is  slightly  rounded,  a  normal  shock  develops  along  the 
stagnation  line,  and  the  Rankine-Hugonoit  relations  can  be  examined  to  ascertain  the  accu¬ 
racy  of  the  computed  solution.  Despite  the  slight  smearing  of  the  shock,  computed  values  of 
both  pressure  and  density  satisfy  the  jump  conditions  to  within  5%.  A  more  accurate  solution 
requires  a  higher  level  of  grid  refinement. 


43 


0.01  0.015  0.02  0.025  0.03  0.035  0.04  0.045 

x/D 


Rgur*  21:  Stagnation  Ragion  DanaKy  RaauKa  for  M.  s  3.0 


Accurate  computation  of  the  flow  in  the  stagnation  region  is  important,  since  this 
region  experiences  relatively  high  pressures  with  a  large  component  of  the  force  acting  in  an 
axial  direction.  Hus  subsequently  enters  into  the  drag  calculations.  Hence,  small  errors  in 
this  region  can  translate  to  measurable  differences  in  drag  results. 

4.4.5  Boundary  Layer  Computations 

Accurate  determination  of  the  skin  friction  requires  that  the  boundary  layer  be  ade¬ 
quately  resolved.  Figure  22  contains  velocity  vector  plots  of  a  small  portion  of  the  boundary 
layer  in  the  nose  region  of  the  missile.  Both  laminar  and  turbulent  boundary  layer  results  are 
presented.  The  length  of  each  vector  corresponds  to  the  magnitude  of  the  velocity  at  a  given 
point.  The  differences  in  the  laminar  and  turbulent  profiles  are  quite  evident  firom  the  figure. 
Of  special  importance  is  the  much  higher  wall  velocity  gradients  in  the  turbulent  case  versus 
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that  of  the  laminar  case.  This  results  in  a  higher  skin  friction  for  the  turbulent  case,  and  is 
precisely  the  reason  for  the  higher  drag  values  observed  for  turbuloit  versus  laminar  flow. 

Although  not  especially  important  in  this  context,  the  shodi  location  is  clearly  evident 
in  the  figure  as  the  flow  turns  into  itself.  The  TVD  scheme  is  able  to  capture  the  shock  in  this 
location  in  a  single  grid  point. 

While  Figure  22  depicts  the  wall  shear  stress,  the  relative  thicknesses  of  the  laminar 
and  turbulent  boundary  layers  are  shown  in  Figure  23.  The  figure  elucidates  the  Mach  con¬ 
tours  of  the  flow  in  the  fin  leading  edge  region.  The  fin  is  deleted  as  the  darkest  portion  of  the 
figure;  the  jagged  leading  edge  of  the  fin  is  a  result  of  the  technique  used  to  model  the  fin.  The 
turbulent  boundary  layer  is  approximately  three  times  thicker  than  the  laminar  boundary 
layer  at  this  location.  These  trends  are  in  agreement  with  theoretical  results. 

4.4.6  Fin-Region  Calculations 

Since  the  fins  provide  the  majority  of  the  pitching  moment  to  the  missile,  it  is 
extremely  important  to  the  accuracy  of  the  solution  that  the  flow  be  computed  correcciy  in 
that  region.  The  presence  of  the  fins  results  in  interesting  and  complicated  flow  structures  as 
are  depicted  in  Figure  24.  The  figure  presents  the  velocity  profiles  at  a  station  ei^t  grid 
points  downstream  from  the  leading  edge  of  the  fins  and  are  obtained  for  flight  at  5°  angle  of 
attack.  The  differences  in  the  flow  qualities  between  the  laminar  and  turbulent  boundary 
layer  computations  are  quite  obvious.  In  particular,  the  laminar  case  is  characterized  by  a 
region  of  flow  between  the  fins  which  possesses  a  relatively  high  downward  velocity  compo¬ 
nent.  This  region  is  restricted  to  an  area  close  to  the  missile  body.  In  contrast,  the  turbulent 
case  has  a  relatively  small  region  of  downward  velocity  whidi  is  localized  in  the  proximity  of 
the  upper  fin.  It  is  quite  possible  that  the  thideer  boundary  layer  associated  with  the  turbu¬ 
lent  flow  allows  for  the  bulk  of  the  flow  to  be  more  easily  entrained  by  the  fireestream.  This  is 
further  evident  from  an  examination  of  Figure  25  which  shows  the  laminar  case  at  a  higher 
Mach  number.  The  increased  energy  of  the  flow  reduces  significantly  the  downward  velocity 
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Figiir»23:  Mach  Contoura  for  Laminar  and  Tuibulofit  Doundaiy  Layara  in  Fin  Region 


component. 


Amommitiun  analysis  using  a  control  volume  approadi  would  indicate  that  a  larger 
downward  velocity  component  in  the  flow  exiting  the  volume  must  translate  to  an  upward 
force  generated  on  the  missile  body.  It  is  difficult,  however,  to  make  such  a  generalization 
since  the  overall  momentum  of  the  flow  cannot  be  ascertained  from  a  single  cut  throufd^  the 
computational  domain. 

Figures  26  examines  the  pressure  distribution  over  the  two  fins  and  depicts  the  same 
view  shown  in  Figure  24.  The  mixing  effect  of  the  turbulence  results  in  "spillage”  of  the  pres¬ 
sure  from  the  lower  to  the  upper  surfaces  ci the  fins.  Amore  equal  pressure  distribution  over 
the  fin  surfaces  corresponds  to  a  reduced  net  lift  and  subsequently  to  a  reduced  pitdiing 
moment 

4.5  Model  Weaknesaea 

In  reference  to  the  experimentally-observed  missile  stability  diaracteristics,  Vitale 
[44]  states,  "A  significant  number  of  flii^ts  were  conducted  in  this  Madi  range  (3.75-4.40)  and 
hence  there  is  a  high  confidence  level  in  the  experimental  data.”  Althou^  the  general  trmds 
in  stagnation  region  pressure  and  density,  boundsury  layer  thickness,  and  wall  shear  stress 
agree  with  theoretical  predictions,  the  discrepant  betwem  the  computed  and  experimentally 
observed  pitching  moment  derivative  and  drag  coefficients  suggests  a  weakness  in  modeling 
portions  of  the  geometries  and  physics  of  the  problem.  Consequently,  several  areas  require 
further  examination. 

4.5.1  Grid  Refinement 

Several  numerically  observed  phenomena  indicate  that  the  computational  grid  is 
inadequate  for  the  intricacies  and  complexities  of  the  problem.  The  inadequacy  of  the  grid 
manifests  itself  in  several  phenomena.  The  first  of  these  is  depicted  in  Figure  27  which  shows 
the  formation  of  a  nonphysical  solution  in  the  stagnation  region.  Instead  of  a  near-instanta- 
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Figure  26:  Fin  PraMuie  Dtatributions  for  Laminar  and  TUrbuiant  Boundary  Layara 
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Rgura  27: 0«v«loping  Non-physical  Solution 

neons  compression  through  the  shod:,  there  is  an  area  of  flow  expansion  as  indicated  by  the 
rapid  pressure  drop.  This  is  a  well-documented  occurrence  of  the  Roe  Scheme  [10,27],  and  is 
known  in  the  literature  as  a  oarhuncU.  It  is  a  nonphysical,  entropy-violating  expansion  shock 
which  occurs  in  the  vicinity  of  sonic  points  and  is  caused  by  a  nearly-zero  value  in  one  of  the 
system  eigenvalues  at  that  point.  The  phenomena  was  observed  in  several  of  the  solutions, 
and  was  only  eliminated  by  adding  artifidal  dissipation  into  the  numerical  scheme.  Since  dis¬ 
sipation  represents  an  irretrievable  loss  of  information,  its  excessive  use  is  undesirable. 
Refinement  of  the  grid  in  the  stagnation  region  represents  a  more  effident  method  of  mitigat¬ 
ing  the  problem. 

While  a  lack  of  grid  refinement  in  the  stagnation  region  can  help  to  fadlitate  the  for¬ 
mation  of  nonphysical  solutions,  grid  coarseness  in  the  fin  region  results  in  abnormal  pres- 
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Hgim  28;i  Abnormal  Prassitfo  SpHn  at  Rn  Loading  Edga 

sure  distributions  al(Kig  the  fins.  This  phenomena  is  depicted  in  Figure  28  vriiidi  shows  an 
abnormal  pressure  spike  at  the  leading  edge  of  the  missile  fin.  The  flow  structure  at  the  fin 
leading  edge  is  extremely  complicated  due  to  the  presence  of  shod/boundary  layer  interac¬ 
tion.  As  a  result,  the  grid  must  ha  sufficiently  refined  to  resolve  the  flow  structures  in  this 
region. 

Ihe  effect  of  insufficient  grid  refinement  in  the  boundary  layer  manifests  itself  in  the 
skin  friction  calculations.  Moran  [29]  has  examined  the  effects  of  wall  spacing  on  the  drag 
coefficient  calculations  and  has  found  that  decreasing  the  wall  spacing  from  0.001  to  0.000025 
results  in  variations  of  as  much  of  20%  in  computed  turbulent  drag  coeffidoit  values.  Lami¬ 
nar  drag  calculations,  however,  were  relatively  xmaffected  by  this  change.  Quite  obviously. 
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FHim  29:  Fin  Gaomaliy 


significantly  more  grid  points  are  required  in  order  to  accurately  model  the  problem. 


4.5.2  Fin  Model 

The  minority  of  the  pitdiing  moment  is  generated  by  the  missile  fins.  It  is  therefore 
very  important  that  the  fin  model  accurately  reflect  the  geometries  of  the  actual  missile.  As 
mentioned  previously,  an  infinitely-thin  fin  model  was  used  in  this  study.  Furthermore,  no 
conformal  gridding  was  used  for  the  fins;  they  were  simply  implemented  as  a  no-sUp  boundary 
condition.  Because  the  fins  are  implemented  as  boundary  conditions  which  are  enforced  at  cell 
centers,  the  fin  is  thus  restricted  to  only  "exist*  at  cell  centers.  A  finite  discretization  of  the 
physical  domain  results  in  a  fin  model  whidi  does  not  completely  reflect  the  physics  of  the 
actual  fins.  This  is  evident  in  Figure  29  whidi  depicts  the  Mach  contours  around  one  of  the 
missile  fins.  The  darkest  region  of  the  figure  r^resents  the  geometry  of  the  fin.  The  inaccura- 
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des  inherent  with  this  fin  model  are  evident  by  the  ragged  profile  of  the  fin  leading  edge.  This 
causes  perturbations  in  the  flow  field  which  are  not  reflected  in  the  idiysical  problem.  A  ccm- 
formal  gridding  scheme  would  eliminate  this  inaccuracy. 

A  second  source  of  inaccuracy  stems  from  the  infinitely-thin  fin  assumption.  Figure  30 
contains  a  series  of  images  depicting  cross-plane  velodties  at  various  positions  along  the  body. 
Careful  examination  of  Figure  30a  reveals  slight  perturbations  in  the  flow  upstream  of  the 
fins.  This  indicates  that  information  concerning  the  fins  is  propagated  upstream  throu^  the 
subsonic  boundary  layer.  Figure  30b  shows  the  flow  at  the  leading  edge  of  the  fins.  The  loca¬ 
tion  of  the  fins  is  clearly  visible  by  the  flow  deflection  near  the  missile  body.  Figure  30c  shows 
the  velodty  distribution  four  grid  points  aft  of  the  fin  leading  edge  middle  Figure  30d  d^icts 
the  flow  at  the  aft  aid  of  the  missile.  While  the  general  behavior  of  the  flow  is  as  expected,  the 
magnitudes  of  the  cross-plane  velocities  are  not  indicative  of  a  fin  of  finite  thickness.  Further¬ 
more,  since  there  is  no  area  on  the  leading  or  tailing  edges  of  the  fins,  there  can  be  no  pressure 
differential  across  these  surfaces  and  sulMequaitly  no  pressure  drag.  The  work  of  Laksh- 
manan  [23],  however,  indicates  that  there  is  indeed  a  pressure  differential  and  a  fairly  sub¬ 
stantial  pressure  drag  contribution.  This  source  of  inaccuracy  is  the  main  reason  the  drag 
numbers  reported  in  Figure  19  are  significantly  below  the  experimentally-obtained  values. 
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V.  Conclusions  and  Recommendations 


5.1  Summary  and  Investigation  Conclusions 

The  objectives  of  this  investigation  were 

•  to  develop  an  algorithm  capable  of  decomposing  the  computational  domain  into  a 
series  of  si^omains  or  bloc^  such  that  calculations  could  be  conducted  in  parallel 
over  the  blodcs,  and 

•  to  use  the  code  to  study  the  aerodynamic  characteristics  of  a  sharp-nosed,  high¬ 
speed  missile  configuration. 


Both  objectives  were  met.  In  terms  of  the  first  objective,  PANS-3EM,  a  block-struc¬ 
tured  computer  algorithm  for  the  study  of  hi^-speed  missile  configurations  was  successfully 
developed  and  tested.  The  algorithm  incorporates  a  block-table  data  structure  which  facili¬ 
tates  the  management  of  the  computational  domains  with  very  little  memory  penalty.  This  is 
extremely  important  given  the  memory  limitations  on  many  of  the  available  computing  plat¬ 
forms. 

In  its  current  form,  the  algorithm  suffers  performance  penalties  associated  with  the 
loop-level  parallelism  that  was  carried  forward  from  the  structure  of  the  serial  code.  With  fur¬ 
ther  modification  coupled  with  the  use  of  operating-system-level  parallelization  routines,  a 
more  coarse-grained  parallel  approach  can  be  taken  to  yield  significant  performance  improve¬ 
ments.  With  the  domain-decomposition  routines  in  place,  the  code  provides  the  flexibility  for  a 
subsequent  modification  to  either  shared-  or  distributed-memory  computing  platform. 

With  the  first  objective  successfully  completed,  the  aerodynamic  characteristics  of  a 
sharp-nosed,  high-speed  missile  configuration  were  investigated.  Nose  pressures  and  densi¬ 
ties  in  the  stagnation  region  accurately  satisfied  the  Rankine-Hugonoit  jump  conditions.  How¬ 
ever,  several  solutions  were  plagued  with  the  presence  of  a  nonphysical  expansion  region  in 
the  vicinity  of  the  sonic  point  which  affected  flow  field  and  pressure  distribution  in  the  imme¬ 
diate  vicinity  of  the  stagnation  point.  The  area  over  which  this  region  affected  the  pressure 
distribution  on  the  missile  was  not  ascertained,  and  further  examination  of  this  phenomena  is 


warranted. 


Particular  attention  was  devoted  to  the  behavior  of  the  pitching  moment  derivative  as 
a  function  of  increasing  Mach  number.  The  reversal  in  the  stability  trend  for  increasing  Mach 
number  as  obtained  from  experimental  data  was  not  verified  by  the  computer  code.  Althouid^ 
a  grid'sensitivity  study  was  not  conducted  due  to  computing  resource  limitations,  it  is 
believed  that  the  current  grid  does  not  possess  the  level  of  refinement  necessary  to  capture  all 
of  the  important  aspects  of  the  flow  structure,  especially  in  the  stagnation  region  and  the  area 
of  the  fins.  This  was  manifest  in  several  phenomena  including  a  20%  change  in  the  computed 
turbulent  drag  coefficient  as  the  wall  spacing  was  refined  from  .001  to  .000025.  Additionally 
abnormal  pressure  spikes  at  the  leading  edge  of  the  fins  observed  for  the  laminar  boundary 
layer  solutions.  These  spikes  were  not  seen  in  the  turbulent  case,  and  it  is  believed  that  they 
would  be  eliminated  given  a  sufficient  level  of  grid  refinement. 

5.2  Suggested  Areas  for  Further  Study 
5.2.1  Parallel  Code  Development 

1)  The  domain-decomposition  of  PANS-3EM  should  be  further  explored  so  that  the 
computational  domains  can  be  completely  decoupled.  This  decoupling  would  facilitate  a 
coarse-grained  parallel  approach  which  has  the  potential  of  much  greater  levels  of  perfor¬ 
mance  and  is  a  required  step  for  implementation  on  a  distributed-memory  computing  plat¬ 
form. 

2)  A  comparative  study  of  the  use  of  buffer  arrays  to  manage  communication  informa¬ 
tion  vs.  the  blodc  table  would  be  beneficial  in  ascertaining  the  best  use  of  computer  resources 
in  terms  of  memory/speed  trade-offs.  Since  buffer  arrays  were  not  implemented  in  this  inves¬ 
tigation,  no  direct  performance  comparisons  were  possible. 

3)  The  current  code  uses  very  large  arrays  to  contain  the  flow  variable  and  grid  met¬ 
ric  information.  New,  very-high  performance,  workstation-type  computers  depend  heavily  on 
memory  caching  in  order  to  achieve  maximum  performance.  The  large  data  structures  of  the 
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current  code  do  not  map  well  to  such  a  caching  architecture,  and  thus  it  is  difficult  to  achieve 
good  computing  performance.  Taking  advantage  of  the  explicit  nature  of  the  TVD  algorithm, 
it  is  possible  to  restructure  the  code  such  that  computations  are  performed  in  a  more  pipe¬ 
lined  fashion  whereby  all  information  for  a  given  node  at  time  level  n+1  is  computed  before 
computations  on  the  next  node  begins.  The  inherent  parallelism  of  the  algorithm  is  unaffected 
by  such  a  modification. 

5.2.2  Aerodynamic  Issues 

1)  Given  the  sensitivity  of  the  computed  flow  field  on  the  calculation  of  the  pitdiing 
moment  coefficient,  it  may  prove  beneficial  to  explore  the  effect  of  various  gridding  techniques 
for  the  fins.  Instead  of  treating  the  fins  as  an  impermeability  boundary  condition  enforced  at 
certain  computational  mesh  points,  a  conformal  gridding  approach  may  improve  the  accuracy 
of  the  calculations  in  the  fin  region.  A  finite-thickness  fin  model  may  also  provide  insist. 
Regardless  of  the  method  of  fin  implementation,  a  finer  grid~both  axially,  radially,  and  azi- 
muthally  should  be  used. 

2)  Since  boundary  layer  growth  rates  are  different  for  laminar  and  turbulent  cases,  a 
study  should  be  conducted  concerning  the  effectiveness  of  the  missile  fins  as  a  function  of 
boundary  layer  thickness.  This  is  especially  important  given  the  boundary  layer/shock  inter¬ 
action  phenomena  that  occurs  at  the  leading  edge  of  the  fins. 

3)  The  Baldwin-Lomax  turbulence  model  used  in  this  study  is  relatively  easy  to 
implement  and  has  been  shown  to  give  acceptable  results  for  fin/body  junction  calculations 
[23].  However,  implementation  of  a  two-equation  model  would  provide  a  means  of  comparison 
of  the  performance  of  the  turbulence  model  and  its  affect  on  the  solution.  A  different  model 
may  provide  the  subtle  difference  necessary  to  verify  the  experimental  data,  or  at  the  very 
least,  it  would  provide  a  verification  of  the  turbulence  model  currently  in  use. 
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Appendix  A:  Nondimensionalization  of  Governing  Equations 


Nondimensionalization  of  the  governing  equation  follows  the  procedure  of  Beran  [4] 
and  is  accompUshed  through  use  of  the  following  parameters 


X  =  x*D 

y  =  y*D  z  = 

z*D 

u  =  u*U 

V  =  v*U  w  = 

■■  w*U 

P  =  P*P,,;. 

•  =  <•  (5) 

p  =  P-P„/ 

E  =  E* 

where  the  nondimensional  parameters  are  denoted  by  a  superscript  D  is  the  diameter  of 
the  missile  body,  and  the  subscript  rs/ refers  to  a  reference  value. 

Substituting  these  values  into  the  governing  equations  }rields  the  following  set  of  non^ 
dimensional  equations. 

Continuity 


^(P*Pref  d(P*Pre/*^  d(p*p^^^*U) 

d(t*  (D/U))  d(x*D)  d(y*D)  d(z*D) 


Pulling  constant  terms  outside  the  differentials  and  multiplying  the  equation  by  (D/U) /p^gf 
gives 


d(p*)  3(P*«*)  3(p*v*)  a(p*w'*) 

d(x*)  a(y*)  a(z*) 


(A2) 


X-momentum 

^iP*Pref*^  .  ,  9(P*Pr*/‘*^*^^>  9(P* 

h{t*{D/U))  a(ac*D)  d(y*D)  d(z*D) 
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d(x*D)  d{x*D)  4^  dy*D^^ 


(A3) 


^  diy*D)  ^ref%*D  dx*D^^  d(z*D)  ^rephz*D  ~  Sx*D^' 


where  the  viscous  stress  terms  have  been  expanded  into  their  velocity-derivative  components. 
Again  pulling  constant  terms  out  of  the  differaatials  and  multiplying  the  resulting  expression 
by  L/  (P„^^)  I  equation  A.3  becomes 


d(p*u*)  d(p*u*^)  3(p*«*v*)  3(p*h*w*)  d(p*) 


dt* 


dx* 


ay" 


dz* 


dx* 


1  .3 


au*  av*. 


^ax*  ^3^*  ;)v* 


dx*  dy*  dy" 


aa* 

dy* 


dv*  d  du*  dw*..  fAA\ 

ax*  ^  az*  dz*  ~  ax*  ^  ^ 


where  Re  is  the  Reynolds  number  based  on  missile  body  diameter. 

A  similar  process  yields  expressions  for  they-  and  ^-momentum  equations. 


Energy 


For  the  sake  of  brevity,  the  derivation  of  the  energy  equation  is  presented  somewhat 
more  succinctly.  Constant  terms  are  pulled  outside  the  derivatives  and  terms  involving  the 
viscous  stress  components  are  not  placed  in  terms  of  velocity  derivatives.  This  results  in 


p,^//Va£,*  a(£,*«*)  a(£,*v*)  a(£,*w*) 

~D^  I  " 


at* 


ax* 


dy* 


dz* 


^refref 


dT*  ^  dT*  ^  dT*  \ 
+ - -  + 


ax* 


dy* 


dz* 


V 


((T*  +X*  +T*  )m*+(x*  +x*  +x*  )v*  +  (x*  +X*  +X*  Iw*) 

"•  XX  yx  zx’  yx  yy  zy  zx  zy  zz 
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Pre/'  ^d(p*u*)  d(p*v*)  d(p*w*)  ^ 

~D~~  ^  dy*  ■*■  ~1i*  ^ 


After  multiplying  equation  A5  by  D/  >  the  coefficient  of  the 


(A5) 


viscous  terms  is  given  as 


»*r<r  1 

Pr,yUD  '  Re 

and  the  coefficient  of  the  heat-flux  terms  is  found  to  be 

Wref 


(A.6) 


(A.7) 


Applying  the  definition  of  Mach  number  and  Prandtl  number  to  the  terms  of  equation  A.7 
yields  the  relation 


\eFref  ^  1 

■  RePriy-\)Ml^ 

Finally,  from  Beran  [4], 


(A8) 


K*  =  H* 


(A.9) 


Thus,  the  energy  equation  becomes 

dE*  diE*u*)  diE*v*)  d(E*w*) 
dt*  ^  3bc*  ^  dy*  ^  dz* 


1 


I'  BT*  BT*  ^  BT*  \ 

3(I*>3jj) 


RePr{y-  l)M 


ref 


Bx* 


By* 


Bz* 


—  ((X* 


+x* 


yx 


+  x*  )u*  +  {x* 
zx'  '  yx 


+  X* 

yy 


+  ^%)v*  +  (x*^ 


+x* 

zy 


+  x*pw*) 


B(p*u*)  B(p*v*)  B(p*w*)^ 


rA  in. 


Appendix  B:  Parallelization  Issues 


Tliis  appendix  discusaea  the  iaauea  which  muat  be  addreaaed  during  paralldization  of 
a  CFD  code.  It  ia  provided  aa  a  road-nu^  for  future  parallelization  eilbrta.  General  iaauea 
including  handling  of  boundary  conditiona  and  aynchronization  iaauea  are  preaented  along 
with  apedfica  of  the  implementation  of  the  parallel  TVD  acheme  uaed  in  thia  project 

B.l  Boundary  Conditions 

Boundariea  for  the  blo<^ed  domaina  can  be  divided  into  two  groupa:  one  group  con- 
taina  thoae  boundariea  whidi  repreaent  phyaical  boundariea  and  are  treated  via  the  mathe¬ 
matical  boundary  conditiona  impoaed  by  the  problem  phyaica,  while  the  aecond  group  containa 
thoae  boundariea  whidi  are  introduced  in  the  blocking  proceaa.  Unlike  the  boundariea  in  the 
firat  group,  theae  boundariea  have  no  phyaical  baaia  and  conaequently,  their  preaence  muat 
not  affect  the  solution  in  any  way.  Figure  B.l  depicta  theae  two  types  of  boundaries.  In  the  fig¬ 
ure,  the  continuous  computational  domain  has  been  cut  into  two  blodcs  and  the  blocks  have 
been  rolled  badi  to  reveal  the  newly-created  boundaries  represented  by  the  shaded  surface  on 
each  block. 

The  domain  decomposition  does  not  alter  the  method  by  which  the  physically  baaed 
boundary  conditions  are  handled.  However,  since  the  boundary  conditions  associated  with 
each  face  of  the  domain  can  be  unique,  this  poses  a  problem  for  homogeneous  parallel  pro¬ 
gramming  in  which  all  threads  of  execution  carry  out  the  same  set  of  instructions.  For  exam¬ 
ple,  at  the  leading  edge  of  the  computational  domain  in  this  implementation,  ^ost  points  are 
reflected  across  the  stagnation  line  in  order  to  provide  suitable  handling  of  the  boundary  con¬ 
ditions  and  the  required  number  of  points  for  second-order  accuracy  of  the  finite-difference  or 
finite-volume  scheme.  Atypical  code  fragment  which  handles  the  boundary  condition  updates 
for  these  points  follows.  For  this  code  fragment  as  well  as  all  following  fragments,  the  follow¬ 
ing  convention  is  used: 

•  The  code  fragments  appear  in  FORTRAN. 
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Itowty 


iMHpliysioAl  bMHHiflftoS 


FlguraB.1:  BlockiKi  Domain  Boundary  lypM 


•  Most  fragments  contains  a  serial  code  section  and  its  corresponding  PANS-3EM 
code  section 

•  Sections  of  code  of  particular  importance  to  the  blocking  process  i^ppear  under¬ 
lined. 

•  Lines  preceded  by  a  “/*”  designate  comment  lines. 

•  The  variable  name  numblk  refers  to  the  number  of  blodcs  in  the  blodc-structured 
computational  domain. 

•  Look-ups  to  the  blodc  table  are  denoted  as  iblktb. 

Code  Fragment  1:  Boundary  Condition  Treatment 


Serial  Code 

do  10  j  =  1,  jmax 
do  10  k  =  2,  kinax-2 

rho(0,j,k)  =  6.*rho(2, j,k)-8.»rho(3, j,k)+3.*rho(4, j,k) 
u(0,j,k)  =  6 .•u(2, j ,k) -8 .•u(3, j,k) +3 .*u{4, j ,k) 

v{0,j,k)  =  6.*v(2, j,k)-8.*v(3, j,k)+3.*v(4, j,k) 

w(0,j,k)  =  6 . *w{2, j ,k) -8 . *w(3, j ,k) +3 . *w(4, j ,k) 

p(0,j,k)  =  6.*p(2, j,k)-8.*p(3, j,k)+3.*p(4, j,k) 

et(0,j,k)  =  p(0, j,k) /gml  +  .5*rho(0, j,k) • 

(u(0, j ,k) **2+v(0, j,k) •*2+w(0, j ,k) **2) 

vu  (0, j ,k,n, 1)  =  rho(0,j,k) 
vu (0, j ,k,n, 2)  =  rho (0, j , k) *u (0, j ,k) 
vu (0, j ,k,n, 3 )  =  rho(0, j ,k) ‘vio, j ,k) 
vu(0, j ,k,n,4)  =  rho(0,3,k) *w(0, j.k) 
vu(0, j,k,n,5)  =  et(0,j,k) 

10  continue 


64 


do  10  L  =  1.  numbik 

call  indxcp(L, 0,0,1, jmax,kinin,kmax, iblkmn, iblkmx, 
jblkmn,  jblkmx,  kbllcinn.kbllanx) 
do  10  i  =  iblkmn.  iblkmx 
do  10  j  =  1,  jmax 
do  10  k  =  1,  kmax 

rho  (i,j,k,L)  =  6.  *rho  (i+2 ,  j  ,  k,L)  -  8  .  ‘rho  li+3,j,k,L)+3.  *rho  ( i->-4 ,  j ,  k,  L) 
u{i,j,k,L)  =  6  .  *u{i+2,  j  ,k,  L) -8  .  *u  (  i  +  3,  j  ,  k,  L)  »3  . ‘u  (  i-f4  ,  j  ,  k,  L) 
v(i,j,k,L)  =  6 . *v(i+2, j , k, L) -8 . •v( i  +  3, j , k, L) +3 . *v( i  +  4 , j , k,  L) 
w(i,j,k,L)  =  6 . ‘wl i+2, j , k,L) -8 . *w( i+3, j , k, L) +3 . *w( i+4, j , k, L) 
p(i,j,k,L)  =  6 . ’p {i+2, j , k, L) -8 . *p (i+3 , j , k, L) +3 . ’p ( i+4 , j , k, L) 
et(i,j,k,L)  =  p (i, j , k,L) /gml  +  . 5*rho ( i, j , k, L) * 

(u(i,j,k,L)**2+v(i,j,k,L)**2+w(i,j,k,L)**2) 

vu (i, j , k, L, n, 1)  =  rho(i,j,k,L) 
vu  (i,  j  ,  k,L,  n,  2)  =  rho  (i,  j ,  k,  L) ‘u  (i,  j  ,  k,  L) 
vu ( i, j ,k, L,n, 3 )  =  rho(i, j , k, L) *v(i, j , k, L) 
vu (i, j ,k,L,n, 4)  =  rho ( i, j , k, L) ‘wii, j , k,L) 
vu(i, j ,k,L,n,5)  =  et(i,j,k,L) 

10  continue 


Although  the  restructured  code  allows  all  threads  of  execution  to  execute  the  loop, 
those  blocks  not  containing  the  ^ost  points  will  merely  continue  since  the  iblkmn  and  iblkmx 
loop  indices  returned  by  the  call  to  indxcp  will  be  0  and  -1^.  Depending  on  the  parallel  imple- 
moitation  and  synchronization  method  used,  execution  threads  which  handle  blodcs  not 
involved  in  this  loop  may  be  able  to  continue  on  to  the  next  code  segment.  If  loop-level  paral¬ 
lelism  is  used,  or  if  the  code  requires  syndhronization  before  the  next  code  segment  can  exe¬ 
cute,  then  threads  will  be  forced  to  wait  or  spin  until  the  thread  handling  the  ghost  point 
completes  execution  of  the  loop.  This  is  a  potential  source  of  inefficiency  whidi  can  greatly 
limit  the  performance  gains  of  parallelism.  The  same  problem  holds  true  for  the  remaining 
boundaries  of  the  computational  domains  and  extends  to  any  section  ai  code  which  handles 
only  a  portion  of  the  computational  domain  sudi  as  those  loops  which  only  operate  over  the 
missile  fins.  Future  implementations  should  investigate  a  more  complete  decoupling  of  the 
domains.  Several  options  are  available,  inclxiding  an  object-oriented  programming  approach 
in  which  the  domains  are  constructed  as  classes.  Boundary  condition  updates  can  be  per¬ 
formed  through  a  class  method  which  can  be  implemented  such  that  all  blocks  update  bound¬ 
ary  information  simultaneously. 

The  reader  should  note  that  the  i*2,  i+3,  and  i+4  array  element  refermces  appearing 
in  the  above  code  fragment  are  not  the  correct  way  to  handle  the  i=2,  i^,  and  i=4  references 


1.  A  discussion  of  the  subroutine  indxcp  and  its  associated  block  table  appears  in  secticm  B.2. 


of  the  original  code  segmoit  and  only  appear  as  such  here  to  avoid  complicatiMi  of  the  lo<^  for 
this  discussion.  Adiscussion  of  the  proper  handling  of  these  references  appears  in  Sections 
B.3andB.4. 


B.2  Inter-Domain  Communication 

All  across-block  references  can  be  thou^t  of  as  inter-domain  communication.  Several 
methods  were  investigated  for  handling  these  references.  In  the  current  implementation,  any 
i-1  or  i+1  reference  in  the  code  constitutes  a  possible  across-blodc  reference.  Althou^  not 
implemmted  in  the  current  version  of  the  computer  code,  k-1,  or  A-fi  reference  also 

constitutes  a  potential  across-block  reference  in  a  completely  generic  domain-blodung 
scheme. 

Because  tha  use  of  buffer  arrays  was  determined  to  be  in^ractical  due  to  memory  con¬ 
straints,  a  blodc  table  data  structure  was  developed  to  manage  the  flow  of  information 
between  the  blodcs  of  the  computational  domain.  The  block  table  was  implemented  as  a  sim¬ 
ple  two-dimensional  array.  Figure  A.2  shows  a  sample  blodc  table  for  a  fully  generic  domain 
decomposition  case  while  Figure  A.3  shows  the  block  table  and  domain  decompositiaa  used  in 
this  pr(gect.  The  boldly  outlined  cells  in  the  blo<^  tables  represent  that  portion  of  the  block 
table  actually  implemented  in  the  data  strucbue.  Althou^  the  block  table  serves  a  key  pur¬ 
pose  in  the  domain-decomposition  process,  blocking  the  domain  by  simply  using  the  block 
table  to  index  the  program  loops  becomes  very  convoluted.  This  is  illustrated  in  code  fragment 
2  which  is  taken  from  the  subroutine  which  computes  the  velocity  and  temperature  gradient 
terms  (subroutine  GRADIENT). 


Code  Fragment  2:  Block  Table  Usage 


Serial  Code 

do  25  k  =  1,  kmax-l 
do  25  j  =  2,  jjnax-1 
do  25  i  =  2,  imax-l 

dudx(i, j ,k) = . 5* ( . 5* (xnip (i, j ,k) +xnip(i-l, j ,k) ) * 

{u(i+l,j,k)-u(i-l, j,k) )  +  .5*{xejp(i, j,k)+xejp(i, j-l,k) ) * 
(u (i, j+l,k) -u(i, j-l,k) )  +  .5*(xzkp(i, j,k)+xzkp(i, j,k-l) ) * 
(u(i, j,k+l)-u(i, j,k-l) ) ) 

25  continue 


66 


max  coordinates  given  for  a  61  x  81  x  35  computational  grid 


Figure  B^:  Generic  block  etrueture  and  aseociated  block  table 
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Block 

Neighboring  Block 

Max  Coordinate 

Num 

imin 

imax 

jmin 

jmax 

kmin 

kmax 

i 

1 

j 

k 

1 

0 

2 

0 

0 

0 

0 

15 

81 

35 

2 

1 

3 

0 

0 

0 

0 

31 

81 

35 

3 

2 

4 

0 

0 

0 

0 

47 

81 

35 

4 

3 

0 

0 

0 

0 

0 

61 

81 

35 

FiguraB^:  Prejact  block  stnictui*  and  asaocialad  block  tabl* 


Using  the  block  table  to  reconstruct  this  loop  requires  special  attention  to  the  i~l  and 
i+1  across-blodc  references.  A  total  of  four  loops  are  thus  required:  two  loops  to  handle  the 
interior  points  for  the  blod:s  and  two  loops  to  handle  the  across-block  references.  The  code  is 
then  restructured  as  follows. 

FANS-3EMCQde 

/♦Handle  the  first  block  except  across-block  references. 


C8 


L  T  3. 

do  10  k  =  1,  kmax-l 
do  10  j  =  2,  jmax-1 
do  10  i  =  2.  lblktb(L.7) -1 

dudx(i,  j  ,k,L)  =  .5*(.5'*  (xnip  (i,  j ,  k,  L)  +xnip  (i-1,  j ,  k,  L) )  * 

(u(i  +  l,j,k,L)-u{i-l,j-k,L))  +  .5*  (xejp  (i,  j  ,k,L)  i-xe  jp  ( i ,  j -1 ,  k,  L)  )  * 
(u{i,j+l,k,L)-u(i,j-l,k,L))  +  .5*(xzkp(i,j,k,L) +xzkp (i, j ,k-l,L) ) * 
(u(i, j ,k+l,L) -u !i, j ,k-l,L) ) ) 

10  continue 

/•Handle  the  remaining  blocks,  but  avoid  across-block  references. 

do  20  L  =  2.  numblk 
do  20  k  =  1,  kmax-l 
do  20  j  =  2,  jmax-1 
do  20  i  =  1.  iblktb(L.7l -1 

dudx(i, j ,k,L) = . 5» ( . 5* (xnip (i, j ,k,L) +xnip (i-l , j , k, L) ) * 

(u(i+l,j,k,L)-u{i-l,j,k,L))  +  . 5* (xe jp (i , j , k, L) +xe jp ( i , j -1 , k, L) ) * 
(u (i, j+l,k,L) -u(i, j-l,k,L) )  +  .5* (xzkp (i, j ,k,L) +xzkp (i, j ,k-l,L) ) * 
(u(i,  j,ki-l,L)-u(i,  j,k-l,L) ) ) 

20  continue 

/•Handle  the  i+1  across-block  references. 

do  30  L  =  1.  numblk-1 
do  30  k  =  1,  kmax-l 
do  30  j  =  2,  jmax-1 
i  =  iblktb(L.7) 

dudx(i, j ,k,L)=.5^{.5^ (xnip (i, j ,k,L) +xnip (i-l, j , k,L) ) • 
(u(l.i.k.iblktb(iblktb(L.2l ) -u( i-l . i . k . Ll  1  + 

.5^ (xejp(i, j ,k,L) +xejp(i, j-l,k,L) ) • 

(u(i,j+l,k,L)-u(i,j-l,k,L))  +  .5^ (xzkp (i, j , k,L) +xzkp ( i , j ,k-l,Ly ) • 
(u(i, j ,k+l,L) -u (i, j, k-l,L) ) ) 

30  continue 

/•Handle  the  i-l  across-block  references 

do  40  L  =  2.  numblk 
do  40  k  =  1,  kmax-l 
do  40  j  =  2,  jmax-1 
i  =  1 

dudx(i, j ,k,L)=.5* ( .5^ (xnip(i, j ,k,L)+ 

xiiiD(iblktb(iblktb(L.7l 1 .i.k.iblktb(L,l) ) • 
(u(i»l.i,k,L)-u(iblktb(iblktb(L.71 ) . i . k. iblktb (L. 1 ) 1  f 
.5^(xejp(i, j,k,L)+xejp(i, j-l,k,L) ) * 

(u(i,j+l,k,L)-u(i,j-l,k,L))  +  .5* (xzkp (i , j ,k,L) +xzkp (i, j ,k-l,L) ) • 
(u(i, j,k+l,L)-u(i, j,k-l,L) ) ) 

40  continue 


The  methodology  used  in  code  fragment  2  is  readily  adaptable  to  a  distributed-mem¬ 
ory  architecture  since  the  indices  which  contain  block  table  references  would  be  implemented 
via  message  passing  on  a  distributed-memory  madiine.  Using  the  Intel  Hypercube  message¬ 
passing  paradigm  as  an  example  [21],  a  message  consists  of  five  parts  which  include  the  mes¬ 
sage  data  type,  message  buffer,  message  length,  receiving  node  identifier,  and  receiving  pro¬ 
cess  identifier.  Examining  the  underlined  variable  xnip,  the  message  data  tjrpe  is  the  type  of 
the  variable  xnip  (double-precision  real),  the  buffer  corresponds  to  the  particular  array  ele¬ 
ment  of  xnip,  the  message  length  is  8  bytes  for  a  double-precision  array  element,  and  the 
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receiving  node  identifier  is  denoted  in  iblktb(L,l).  The  receiving  process  identifier  depends  on 
the  implementation. 

Tlxe  utility  of  the  buffer  arrays  mentioned  in  Chapter  3  is  also  obvious  in  code  firag- 
mmit  2.  The  existence  of  buffer  arrays  eliminates  the  need  for  special  handling  of  the  across- 
block  references  in  the  third  and  fourth  loops.  The  arrays  would  simply  be  updated  once  at  the 
beginning  of  each  sweep  and  the  indexing  on  the  i  variable  incremented  over  the  block  bound¬ 
aries.  Again,  while  this  method  is  very  efficient  in  terms  of  execution  speed  (once  the  arrays 
are  updated),  the  memory  penalty  can  be  prohibitive. 

Ihe  implementation  shown  in  code  fragment  2  does  not  increase  the  order  of  complex¬ 
ity  of  the  computer  code;  it  is  still  0(n)  where  n  is  the  total  number  of  grid  points  in  the  com¬ 
putational  mesh.  In  fact,  the  only  overhead  incurred  m  this  implementation  is  that  associated 
with  setting  up  the  additional  looping  structures  and  with  the  double  lookup  of  array  indices 
given  by  ihlktb(.iblktb(.L,7)).  Des'  te  this  fact,  the  code  is  fraught  with  several  weaknesses. 
They  include: 

•  The  code  relies  on  the  fact  that  the  first  block  along  the  i  direction  is  block  1  and 
the  last  block  in  the  i  direction  contains  the  last  i  point.  This  may  not  be  the  case  for  a 
completely  generic  implementation. 

•  The  volume  of  additional  code  generated  makes  the  introduction  of  mistakes  very 
likely. 

*  Eadi  loop  handles  only  a  portion  of  the  blocks,  thus  making  an  efficient  parallel 
implemoatation  very  unlikely. 

*  The  code  structure  does  not  handle  computational-to-block  coordinate  transforma¬ 
tions  well. 

While  the  first  three  weakness  are  problematic,  the  fourth  makes  the  approach  shown 
in  code  fragment  2  unusable. 

B.3  Computational  and  Block  Coordinates 

Once  the  computational  domain  has  been  decomposed,  the  ordered  triple  locating  a 
variable  in  the  computational  domain  cannot  be  used  to  locate  that  variable  in  the  block  coor¬ 
dinate  space.  This  poses  a  challmiging  problem  in  determining  proper  loop  indices  for  the  pro¬ 
gram  loop  constructs.  An  example  involving  the  missile  fins  amply  illustrates  this  point. 
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Min  mn2 

Figura  B.3:  ComputationaMo^ioek  Cootdinata  Transformation 

Calculation  of  the  gradients  at  the  missile  fins  require  special  handling,  since  the  fins 
are  implemented  as  a  impermeability  boundary  condition  in  the  computer  code.  Consequently, 
there  are  code  loops  with  indices  of  ifin,  ifin-1,  ifin2-2,  etc.  While  it  is  very  straightfor¬ 
ward  to  determine  the  location  of  ifin2-l  given  the  location  ifin2  in  the  computational  coordi¬ 
nate  space,  it  is  much  less  a  trivial  problem  in  the  block  coordinate  space.  Although  the  fin 
configuration  in  the  domain  decomposition  used  in  this  project  did  not  pose  a  great  problem,  it 
is  possible  that  a  different  decomposition  could  yield  the  configuration  depicted  in  Figure  B.3. 
Because  the  leading  edge  of  the  fin  (ifin)  appears  on  a  block  boundary,  ifin-1  is  not  found  sim¬ 
ply  by  looking  one  location  to  the  left;  instead,  it  appears  in  a  different  blodc.  A  loop  construct 
with  indices  of  ifin-1  and  ifin2  would  thus  require  careful  handling.  The  following  code  frag¬ 
ment  loops  over  the  region  of  the  physical  domain  from  the  leading  edge  of  the  missile  fins  to 
one  point  before  the  trailing  edge  of  the  missile  fins. 
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Code  Fragment  3:  Loop  Index  Example 


Serial  Code 

do  310  i  =  ifin.  ifin2-l 
do  310  i  =  ifindUl.  imax-1 

signx  =  sign (1 . 00, alpha (i, j , 1, 2) ) 

dargl  =  min (abs (alpha ( i, j , 1, 2) ) , signx*alpha ( i, j , 1, 1 ) ) 
g(i,j,l,2)  =  signx  *  max (0 .00, dargl) 

signx  =  sign (1 . 00 , alpha ( i, j , 2, 2 ) ) 

dargl  =  min(abs (alpha {i, j , 2, 2) ) , signx*alpha ( i, j , 2 , 1) ) 
g(i,j,2,2)  =  signx  *  max (0 .00, dargl) 

signx  =  sign ( 1 . 00 , alpha ( i, j , 3 , 2 ) ) 

dargl  =  min(abs (alpha ( i, j , 3, 2) ) , signx’alpha ( i, j , 3 , 1) ) 
g(i,j,3,2)  =  signx  *  max(0. 00, dargl) 

signx  =  sign (1 . 00, alpha ( i, j , 4 , 2 ) ) 

dargl  =  min(abs (alpha (i, j , 4, 2) ) , signx*alpha (i, j , 4, 1) ) 
g(i,j,4,2)  =  signx  *  max (0.00, dargl) 

signx  =  sign ( 1 . 00 , alpha ( i, j , 5, 2 ) ) 

dargl  =  min(abs (alpha ( i, j , 5, 2) ) , signx’alpha ( i, j , 5, 1) ) 
g(i,j,5,2)  =  signx  *  max (0.00, dargl) 

310  continue 


If  it  is  assumed  that  the  block  decomposition  varies  depending  on  the  number  of  grid 
point  and  the  number  of  available  processors,  then  it  cannot  be  known  a  priori  in  which  block 
the  leading  and  trailing  edge  of  the  fins  will  exist  Furthermore,  any  additional  geometries  or 
features  to  the  missile  which  require  special  numerical  computations  will  greatly  exacerbate 
this  problem.  Code  fragment  2  required  special  loops  to  handle  across-block  references,  yet  it 
is  obvious  that  it  is  not  possible  to  unequivocally  state  whether  or  not  across-boundary  refer¬ 
ences  are  required  in  a  looping  construct  between  two  arbitrary  i  coordinates.  This  makes  the 
code  listed  in  code  fragment  2  highly  specific  and  impractical  to  generalize  for  handling  arbi¬ 
trary  geometry  or  fiow-field  specific  calculations. 


B.4  Final  Implementation 

The  coordinate  transformation  and  across-block-reference  problems  were  solved  via 
the  implementation  of  a  subroutine  which  takes  as  an  input  a  computational  coordinate  and 
computes  proper  loop  indices  in  block-coordinate  space.  With  this  implementation,  code  frag¬ 
ment  3  becomes 
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PANS-3EMCQde 

do  10  L  =  1.  numblk 

call  lndxcD(L. ifln.ifin2-l. jmin. jmax, kmin. kmax. iblkmn. ibllonx. 

jblkmn, jblkmx, kblkmn, kblkmx) 
do  10  i  =  iblkmn.  iblkmx 

do  10  ,1.=  1£in(L,iitl,  jmax-l 

signx  =  sign (1 . 00, alpha ( i, j , L, 1, 2) ) 

dargl  =  min(abs (alpha (i, j , L, 1, 2 ) ) , signx*alpha ( i, j , L, 1 , 1) ) 
g(i,j,L,l,2)  =  signx  •  max (0 . 00, dargl) 

signx  =  sign(l . OO.alpha (i, j , L, 2, 2) ) 

dargl  =  inin(abs (alpha (i, j , L, 2, 2) ) , signx*alpha ( i, j , L, 2, 1) ) 
g\i,j,L,2,2)  =  signx  *  max (0.00, dargl) 

signx  =  sign (1 .00, alpha (i, j , L, 3, 2) ) 

dargl  =  min(abs (alpha (i, j ,L, 3, 2) ) , signx*alpha ( i, j , L, 3, 1) ) 
g(i,j,L,3,2)  =  signx  •  max (0.00, dargl) 

signx  =  sign(1.00,alpha(i, j,L,4,2) ) 

dargl  =  min(abs (alphad,  j , L, 4, 2) ) ,  signx'alpha ( i,  j , L,  4, 1) ) 
g(i,j,L,4,2)  =  signx  •  max (0 .00, dargl) 

signx  =  sign (1 .00, alpha (i, j , L, 5, 2) ) 

dargl  =  min (abs (alpha ( i, j , L, 5 , 2) ) , signx*alpha ( i, j , L, 5 , 1 ) ) 
g(i,j,L,5,2)  =  signx  *  max(0 .00, dargl) 

10  continue 


The  subroutine  indxcp  utilizes  the  block  table  in  coiijunction  with  the  coordinate 
transformations  given  by  equations  39  and  40  to  compute  proper  loop  indices  for  iblkmn  and 
iblkmx.  As  an  example,  taking  ifin  =  44  and  ifin2~l  =  58,  the  computed  values  of  the  loop  indi¬ 
ces  returned  by  incLxp  are  provided  in  Table  B.l. 


Table  B.1:  Example  loop  limiter  valuea  returned  by  indxcp 


Block 
Number  (L) 

Iblkmn 

bikmx 

1 

0 

-1 

2 

0 

-1 

3 

14 

16 

4 

11 

In  this  example,  blocks  1  and  2  perform  no  computational  work  within  the  loop.  Thus, 
processors  handling  those  two  loops  can  either  be  allowed  to  continue  execution  on  other 
loops,  or  if  synchronization  is  required,  can  be  forced  to  wait  for  other  processors  to  finish  the 
loop. 
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Althou^  the  subroutine  indxcp  addresses  the  issue  of  computational  to  blodc  coordi¬ 
nate  transformations  and  eliminates  any  specialized  loops  which  are  specific  to  certain 
blodks-  as  seen  in  code  fragmoit  2-acro8s-blodi  references  are  still  not  properly  treated.  For 
this  reason,  an  additional  routine  was  added  to  all  loops  containing  across-blodc  references. 
The  use  of  this  routine  is  shown  in  the  following  code  fragment  taken  from  the  turbulent  vis¬ 
cosity  calculation  subroutine. 


Code  Fragment  4:  Aoross-Block  Reference  Check 


Serial  Code 

do  116  i=icrans, iend-1 

if  (vort (i, jbegin(i) -l,k)  .It.  .1)  then 
bstar  =  0. 
else 

dpdxi  =  0 . 5* (p (i+1 , jbegin(i) -l,k) -p ( i-1, jbegin (i ) -1, k) ) 
endi ' 

116  continue 


PANS-3EMCode 

do  116  L  =  1.  numblk 

call  indxcD (L. itrans . iend-1 . iroin. imax. kmln. kmax. iblkmn. iblkmx. 

jblkmn,  jblkmx,  kblkitm,  kblkmx) 
do  116  i  =  iblkmn.  iblkmx 
lf(  i  .eg.  iblkmn  1  then 
Lilt  =  L 
iJil  =  i 

iPl  =  i^-1 

LIP  =  L 

idir  =  -1 

call  offset!  Llm.  iml .  idir.  1  ) 
elseifl  i  .eg.  iblkmx  1  then 
Llm  =  L 
iml  =  i-1 

lib  =  L 

idir  =  1 

call  g££setl  LlB.  ipl.  idir.  l  l 

else 

Llm  =  L 
iml  =  i-1 
LIP  =  L 
iPl  =  i-t-1 
endif 

if(  vort (i, jbegin{L, i) -l,k,L)  .It.  .1  )  then 
bstar  =  0. 
else 

dpdxi=0 . 5* (p ( jpl. j begin { L, i) -l.k.Llol -p(iml. jbegin(L, i) -l.k.Llm) ) 
endif 

116  continue 


Subroutine  offset  takes  as  inputs  the  i  coordinate  in  block  coordinate  space  as  the  first 
two  parameters  of  the  subroutine  along  with  the  direction  of  the  offset  (either  -1  or  -t-l  for  a 
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decrement  or  increment,  respectively)  in  the  third  argiunent  and  finally  the  number  of  points 
to  offset  firom  the  current  block  coordinate  in  the  fourth  parameter,  and  the  new  block  coordi¬ 
nates  are  returned  in  the  first  two  subroutine  arguments.  With  this  final  addition  to  all  loops 
containing  across-block  rrferences,  the  block-decomposition  process  is  complete.  Advantages 
to  this  method  include: 

•  The  inmlementation  is  extremely  flexible.  Given  the  block  table  and  proper  coordi¬ 
nate  transformation  equations,  any  looping  index  can  be  properly  treated. 

*  Modifications  to  the  looping  calculations  are  localized  to  a  single  subroutine 
rather  than  scattered  through  the  code. 

*  A  blocking  implementation  in  the  other  two  coordinate  directions  can  be  coded 
into  the  subroutine  to  allow  for  a  completely  generic  blodc  decomposition. 

•  No  memory  is  wasted  in  buffer  arrays. 

The  code  listings  for  subroutines  indxcp  and  o/fset  appear  in  Appendix  D.  Examina¬ 
tion  of  the  code  reveals  that  both  subroutines  are  of  0(1)  and  thus  they  do  not  affect  the  com¬ 
plexity  of  the  computer  code.  The  overhead  associated  with  the  subroutine  call  can  be  reduced 
by  inlining  the  subroutines.  There  are,  however,  performance  penalties  associated  with  the 
domain  decomposition  which  are  discussed  in  Chapter  4. 

B.5  Vectorization  issues 

Although  the  serial  code  was  written  to  be  highly  vectorizable,  this  was  not  a  priority 
in  the  blocked  version  of  the  code  since  near-term  target  machines  for  implementation^  were 
all  RISC-based  scalar  machines.  Consequently,  the  reduction  of  code  vectorizability  caused  by 
the  addition  of  the  conditional  such  as  that  appearing  in  code  fragment  4  was  deemed  an 
acceptable  trade-off  for  the  lower  memory  required  when  compared  to  the  buffer-array  con¬ 
cept.  Code  fragment  5  shows  a  typical  looping  construct  similar  to  that  given  in  code  fragment 
5  along  with  a  vectorization  sxunmary  from  the  Convex  C220  Fortran  compiler.  The  loop  body 
is  omitted  in  this  example  since  it  is  not  required  to  demonstrate  the  loop  vectorizability. 

1.  Current  implementation  plans  are  for  a  shared  monory  DEC  alpha-based  system,  and  a  distrib¬ 
uted  memory  workstation  farm  concept.  Longer-term  plans  call  for  inq)lementation  cm  the  Intel 
i860™  XP-based  Paragon,  a  vector  machine. 
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Cndg  Fragiwnt  S:  VprtnriMhility  Issues 


Serial  Code 

do  8881  k  =  2,  kmax-2 
do  100  j  =  1,  jmax-1 
do  100  i  =  2,  ifin-1 

/•loop  body  with  across-block  references  appears  here 

100  continue 
8881  continue 


Tile  vectorizatdon  summary  for  this  loop  is  given  as 


Iterative  Variable 

K 

J 

I 


Reordering  Tranuformatiftn 

Parallel 

Scalar 

FULL  VECTOR 


PANS-SEMCode 

do  8881  k  =  2,  kinax-2 
do  100  L  =  1,  numblk 

call  indxcp(L, 2, ifin-1 , jmin, jroax,kmin, kmax, iblkmn, iblkmx, jblkmn, 
jblkmx,  kblkinn, kblkinx) 
do  100  i  =  iblkmn,  iblkmx 
if (  i  .eg.  iblkmn  )  then 
Llm  =  L 
iml  =  i 
idir  =  -1 

call  offset (  LI,  iml,  idir,  1  ) 
else 

iml  =  i-1 
Llm  =  L 
endif 

do  100  j  =  1,  jmax-1 

/•loop  body  with  across-block  references  appears  here 

100  continue 
8881  continue 


Ihe  vectorization  summary  for  this  loop  is 


Iterative  Variable 

L 

I 

J 

K 


Reordering  Transformation 


Scalar 

Scalar 


FULL  VECTOR  Interchanged 
Parallel 


The  conditional  destroys  the  vectorizability  over  the  i  index.  On  the  Convex  C220,  the 
compiler  instead  vectorized  over  the^'  index.  Hiis  vectorization  is  not  as  efficient  due  to  the 
fact  that  vectorization  over  the  second  array  index  results  in  a  non-stride-one  or  non-contigu- 
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0U8  meoiory  access  whidi  can  be  considerably  slower  than  a  stride-one  memory  access  [81; 
however,  vectorization  and  non-stride-one  memory  access  issues  are  highly  compiler  and 
machine-ardiitecture  specific  [23,37],  and  the  issue  must  be  cardully  examined  on  any  vector 
machine  of  implementation.  In  this  case,  it  may  be  possible  to  force  vectorization  over  the  i 
index  with  the  use  of  compiler  directives. 

B.6  Synchronization  Issues 

Reference  [35]  notes  that  coarse-grained  parallelism  generally  adiieves  better  perfor¬ 
mance  gains  than  a  fine-grained  parallel  approach.  Coarse-grained  parallelism  can  be  viewed 
as  program-level  or  at  least  subroutine-level  parallelism  as  opposed  to  fine-grained  parallel¬ 
ism  whidi  achieves  concurrency  at  the  loop  level.  The  code  as  implemented  in  this  project 
uses  a  fine-grained  parallel  approach.  Implementation  of  coarse-grained  parallelism  on  the 
serial  version  of  the  code  requires  a  great  deal  more  code  restructuring  as  well  as  certain  soft¬ 
ware  tools  capable  of  explicit  control  over  the  parallel  processes.  Figures  B.4  and  B.5  depict 
the  differences  between  coarse-  and  fine-grained  parallelism.  In  the  coarse-grained  approach, 
the  initialization  of  a  process,  known  as  thread  forking,  occurs  very  infrequently.  On  the  other 
hand,  thread  foiking  occurs  much  more  often  in  a  fine-grained  approach.  Since  each  forking 
process  requires  approximately  400  clock  cycles  to  execute  [34],  the  fine-grained  approach  can 
require  significantly  greater  execution  times  for  CFD  codes,  which  are  very  heavily  loop  struc¬ 
tured. 

While  the  fine-grained  parallel  approach  does  suffer  from  performance  problems,  syn¬ 
chronization  issues  are  not  as  great  a  concern  because  each  loop  is  completed  by  all  processors 
before  execution  of  the  next  loop  begins.  This  is  in  contrast  to  the  coarse-grained  parallel 
approach  for  which  it  cannot  be  known  in  which  order  the  parallel  loops  will  execute.  Code 
fragment  6  illustrates  the  potential  synchronization  problems. 
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FiguraB.S:  Fiii*>grain«d  panriMism 


2004  continue 


/’portion  of  code  omitted  here 


do  2007  )c=l,kmax-l 
do  2007  j=l,jmeuc-l 
do  2007  i=l,imax-l 

/’portion  of  loop  omitted  here 

zkxipi  (i,  j  ,k)  =  sqrt  (xnip ( i,  j ,  k)  ”2+ynip (i,  j  , k)  ”2  + 

znip(i,  j,k)”2)/  ( .5’  (jtild(i+l,  j,k)+jtild(i,  j,k)  ) ) 

/’portion  of  loop  omitted  here 

2007  continue 


PANS-aEMrnrte 

do  2004  L  =  1,  numblk 

call  indxcp(L, 2, imax, jmin, jmax.kmin, kmax, iblkmn, iblkmx, 
j  blkmn , j  blkmx, kblkron , kblkmx ) 
do  2004  i  =  iblkmn,  iblkmx 
if (  i  .eq.  iblkmn  )  then 
Llm  =  L 
iml  =  L 
idir  =  -1 

call  offset (  Llm,  iml,  idir  1  ) 
else 

Llm  =  L 
iml  =  i-i 
endif 

do  2004  j  =  1,  jmax 
do  2004  k  =  1,  kmax 

jtildd,  j  ,k,L)  =  (l./3.*(  (xejp(i,j,k,L)+xnip(i,  j,k,L)  + 
xzkp(i,  j  ,k,L) )  Mxpbd,  j,k,L)-xpb(iml,  j-l,k-l,Llm) )  + 
(yejpli, j,k,L) +ynip{i, j,k,L) +yzkp( i, j , k, L) ) • 

(ypb(i, j ,k,L)-ypb(iml, j-l,k'l,Llm) )+ 

(zejp(i,j,k,L) +znip(i, j ,k,L) +zzkp{i,  j ,  k,  L) ) ’ 

(zpb(i, j,k,L)-zpb(iml, j-l,k-l,Llm) ) ) ) 

2004  continue 


/’portion  of  code  omitted  here 


do  2007  L  =  1,  numblk 

call  indxcp (L, 1, imax-1, jmin, jmax,kmin, kmax, iblkmn, iblkmx, 
jblkmn, jblkmx, kblkmn, kblkmx) 
do  2007  i  =  iblkmn,  iblkmx 
if(  i  .eq.  iblkmx  )  then 
Lip  =  L 
ipl  =  L 
idir  =  1 

call  offset;  Lip,  ipl,  idir  1  ) 
else 

Lip  =  L 
ipl  =  i+1 
endif 

do  2007  j  =  1,  jmax-1 
do  2007  k  =  1,  kmax-1 

/’portion  of  loop  omitted  here 
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zkxipi  (i,  j  ,k;,L)  =  sqrt  (xnip( i,  j  ,  k,  L)  **2-fynip (i ,  j  ,  k, L)  •  *2  » 
znipd.i  ■k.L)»*2)/  ( .  5*  ( 1cild(  iol .  1 ,  k.  Llpi  i  . ,  k .  D  j  ) 

/^portion  of  loop  omitted  here 

2007  continue 

In  loop  2007,  for  i  s  iblkmx,  the  computation  of  the  variable  zkidpi  uses  an  across 
block  reference  of  the  variable  jtild  where  the  value  for  jtild  is  computed  in  loop  2004.  If  code 
for  each  blodE  is  executing  on  separate  processors,  then  it  cannot  be  assumed  that  the 
required  array  element  at  Jtild  has  been  calculated  when  the  across-block  referoice  is  made. 
Failure  to  synchronize  the  code  before  execution  of  loop  2007  will  result  in  errors  which  are 
nondeterministic  in  nature  and  consequently  extremely  difficult  to  find. 

The  situation  is  resolved  by  utilizing  an  operating-system-level  fimction  call  or  com¬ 
piler  directive  to  ensure  that  all  threads  of  execution  reach  a  rendezvous  point  before  the 
potentially  incorrect  reference  is  made. 

fi.7  Future  Code  Modification  Issues 

As  discussed  in  Chapter  3,  the  objective  at  the  development  of  PANS-3EM  was  to  pro¬ 
vide  a  code  readily  adaptable  for  implementation  on  either  a  shared-  or  distributed-memory 
computing  platform.  Each  of  these  platforms  require  a  different  implementation  style. 

B.7.1  Shared-Memory  Implementation 

PANS-3EM  has  been  successfully  run  on  several  shared-memory  computing  plat¬ 
forms.  Parallelism  can  be  achieved  at  the  loop  level  in  one  of  three  methods.  The  first  involves 
the  simple  use  of  a  command  line  compiler  flag  which  leaves  the  parallelization  task  exclu¬ 
sively  to  the  compiler.  Because  modem  compilers  are  relatively  sophisticated  at  loop-level 
optimizations,  this  method  can  produce  substantial  performance  improvements  with  little  or 
no  work  on  the  part  of  the  programmer. 

The  second  method  of  achieving  paralleUsm  is  to  embed  compiler  directives  within  the 
source  code.  This  method  has  the  advantage  of  potentially  greater  performance  gains  than  the 
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first  method,  but  it  requires  more  programming  effort. 

Finally  the  third  method  of  achieving  parallelism  requires  the  existence  of  operating 
system  level  functions  which  provide  the  programmer  explicit  control  over  nearly  all  aspects 
of  program  parallelization.  Unlike  the  first  two  methods,  coarse-grained  parallelism  can  be 
achieved.  Therefore,  this  method  provides  the  greatest  potential  benefit  in  terms  of  perfor¬ 
mance  improvement.  In  order  to  utilize  this  method,  PANS-3EM  should  be  restructured  to 
allow  for  a  greater  degree  of  domain  decoupling  than  is  possible  in  the  currant  implementa¬ 
tion.  This  requires  the  examination  of  the  code  to  determine  all  data  dependencies  and  syn¬ 
chronization  points.  The  reader  is  referred  to  References  [35]  and  [34]  for  excellent 
discussions  on  this  type  of  shared-memory  parallel  programming. 

B.7.2  Distributed-Memory  Implementation 

Since  a  distributed-memory  machine  generally  has  no  block  of  common  memory  to 
contain  variables  used  by  all  subdomains,  implementation  of  PANS-3EM  on  such  a  platform 
requires  the  complete  decoupling  of  the  subdomains.  This  can  be  accomplished  by  modifying 
the  use  of  the  fourth  index  of  the  data  arrays  (the  L  index)  to  identify  a  particular  processor 
on  which  a  given  subdomain  executes.  A  decoupling  of  the  domains  provides  the  greatest 
potential  performance  improvement  Using  this  approach,  it  is  possible  to  allow  the  time-inte¬ 
gration  to  proceed  at  different  rates  over  each  of  the  subdomains  depending  on  the  nature  of 
the  flow  and  structure  of  the  grid.  Since  Moran  [28]  has  already  examined  the  stability  issues 
associated  with  local  time  stepping  in  the  serial  code,  few  stability-related  problems  should 
arise  in  this  type  of  modification. 

Other  issues  related  to  a  distributed-memory  implementation  include  the  message 
passing  capabilities  and  memory  capacities  of  a  particular  machine.  As  the  size  of  the  problem 
increases  or  as  the  computational  mesh  is  refined,  memory  limitations  will  affect  the  number 
of  subdomains  that  must  be  used.  A  greater  number  of  subdomains  corresponds  to  a  greater 
amount  of  information  which  must  be  communicated  via  messages.  If  this  information  is  not 
communicated  effectively,  then  parallel  efficiency  drops  dramatically  and  the  scalability  of  the 
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computer  algorithm  is  adversely  affected. 

Ihe  message  passing  c^abilities  of  the  target  machine  will  also  affect  the  dioice  of 
data  structures  used  to  hold  the  messages.  For  example,  it  is  possible  to  pass  individual  array 
elements  as  messages  for  across-block  references.  However,  it  may  be  more  feasible  to  pass  an 
entire  array  section  than  an  individusd  array  elmnent  for  across-blo(^  references.  If  so,  the 
buffer-array  concept  should  be  carefully  examined  since  the  performance  gains  may  outwei^ 
the  incurred  memory  penalty. 
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Appendix  C:  Code  Modification 


Approximately  21000  lines  of  code  comprise  the  serial  version  of  the  computer  pro¬ 
gram.  The  extra  control  and  looping  structures  required  for  the  subdomain  implementation 
resulted  in  7000  additional  lines  of  code.  In  addition  to  loop  modifications,  certain  structural 
aspects  of  the  serial  code  were  changed  slightly  durii^  the  modification  process.  A  synopsis  of 
those  changes  appears  in  this  appendix  in  the  form  of  high-level  program  flow  charts.  The 
code  flow  diagrams  for  the  serial  and  parallel  code  versions  appear  in  Figures  B.l  and  B.2 
respectively. 
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C.1  serial  coda  flow  diagram  (1  of  3) 


C.1  S«rMcod«fiowdiagnun(2of  3) 


C.1  S«rtal  cod*  flow  diagnun  (3  of  3) 
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Rgura  PANS-3EM  flow  diagram  (1  of  3) 
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nguraC^  PANS-3EMflowdiagrain(2of3) 
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ngura  PANS^M  flow  diagram  (3  of  3) 


90 


Appendix  D:  Code  Listings 


This  appendix  contains  listings  of  the  routines  whidi  are  crucial  to  the  domain-decom¬ 
position  approach  taken  in  this  investigation.  Listings  include 

•  the  code  Usting  for  the  domain  decomposition  subroutine, 

•  the  code  listing  for  the  loop  index  computation  subroutine, 

•  the  code  listing  for  the  offset  calculation  s\ibroutine, 

•  the  code  listing  for  the  computational-to-block  coordinate  transformation  subrou¬ 
tine. 

Several  features  of  the  subroutines  were  not  fully  utilized  in  the  current  implementa¬ 
tion  of  PANS-3EM  but  are  provided  for  fixture  expandability.  A  brief  discussion  of  each  follows. 


D.l  Domain  Decomposition  Subroutine  (subroutine  domain) 

This  subroutine  divides  the  computational  domain  into  the  subdomains  and  builds 
the  block  table  described  in  Chapter  3  and  Appendix  B.  The  current  implementation  divides 
the  computational  domain  as  evenly  as  possible  over  the  i  coordinate  direction.  Given  m  nodes 
and  n  subdomains  in  the  i-coordinate  direction,  the  first  n-1  subdomains  will  contain  k  points 
where 


‘-["1 

and  the  last  subdomain  will  contain  m  -  (n  - 1)  it  points  in  the  i  direction.  With  the  lack  of 
dynamic  memory  allocation  capabilities  in  the  current  language  of  implementation,  this 
approach  can  result  in  a  maximum  of  n-1  wasted  memory  locations  since  all  subdomains  must 
be  dimensioned  the  same.  This  problem  can  be  easily  solved  if  the  language  of  implementa¬ 
tion  is  changed,  or  the  code  is  modified  for  use  on  a  distributed-memory  system. 
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Coda  Lifting  ftw  SubroudnA  Hnniain 


c*  This  subroutine  sets  up  the  subdomain  block  table  which  determines 

c*  which  blocks  must  exchange  data  across  a  face.  Current 

c*  implementation  restricts  the  blocks  to  be  contiguous,  with  no 

c*  jogs  or  shifts  of  the  domain  boundaries  across  the  block  boundaries. 

c*  See  the  documentation  for  a  description  of  the  allowable  block 

c*  configurations. 

c» 

c*  Block  table  data  structure:  The  block  table  is  implemented  as  a 
c*  two  dimensional  array  with  the  first  index  referencing  the 
c*  particular  subdomain  block.  The  second  index  ranges  from  1  to  9 
c*  with  each  index  representing  the  following: 
c* 

c*  l--the  neighbor  tc  the  block  on  its  IMIN  side 

c*  2 --the  neighbor  to  the  block  on  its  IMAX  side 

c’  3 --the  neighbor  to  the  block  on  its  JMIN  side 

c*  4--the  neighbor  to  the  block  on  its  JMAX  side 

c*  5--the  neighbor  to  the  block  on  its  KMIN  side 

c’  6--the  neighbor  to  the  block  on  its  KMAX  side 

c*  7 --the  number  of  nodes  within  the  block  in  the  i  direction 

c*  8--the  number  of  nodes  within  the  block  in  the  j  direction 

c’  9--the  number  of  nodes  within  the  block  in  the  k  direction 

c* 

c’  Fields  1-6  will  contain  a  block  number  if  the  block  has  a 

c*  neighbor  on  chat  side.  If  the  block  has  a  boundary  on  that 

c*  side,  the  field  will  contain  a  zero, 
c* 

c*  Fields  7-9  are  required  in  case  the  blocks  do  not  all  contain 

c*  Che  same  number  of  nodes  in  a  given  coordinate  direction, 

c* 

c*  Last  Modification  Date:  29  Aug  93 

c» 

c*  Comments: 
c* 


subroutine  domain)  ) 

INCLUDE  ' commons ' 

integer  minj  (numblk) ,  max  j  ( r.umblk ) , 

St  mink  (numblk) ,  maxk(numblk) 

c  character*!  answer 

c  characcer*15  domfile 

imin  =  1-ighost 
jmin  =  0 
kmin  =  0 
c 

c  Get  input  from  the  user  to  determine  subdomain  decomposition 
c  preferences 
c 

write)*,*)  'Is  there  a  domain  decomposition  file  (y/n)?' 
read ( * , * )  answer 
if)  answer  .eq.  'n'  )  then 
write)*,*)  'Do  you  want  the  program  do  set  up'. 

Sc  '  subdomains  (y/n)?' 

read ( * , * )  answer 
if)  answer  .eq.  'n'  )  then 
write)*,*)  'No  explicit  parallelization  will  occur' 
numblk  =  1 
else 

write)*,*)  'Input  the  number  of  sulxiomains  (max  =  12) 
read ( * , * )  numblk 

write)*,*)  'Current  decomposition  technique  divides'. 
Sc  '  subdomains  evenly  along  the  i  axis' 
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write!*,*)  'Enter  the  number  of  grid  points  in  the', 
i  '  i  direction' 

read ( * , * )  imaximum 
iblknd  =  idim/numblk 

Determine  the  maximum  and  minimum  coordinate  values  that  will 
c  lie  in  each  block.  These  values  are  the  actual  coordinate  values 
c  including  ghost  points  and  are  used  to  determine  block 
c  adjacencies  in  the  next  section  of  the  code, 
c 

mini!!)  =  1-ighost 
maxi(l)  =  mini (1) +idim-l 
minj (1)  =0 
maxjU)  =  jmax 
mink(l)  =  0 
maxk(l)  =  km, ax 
do  5  i  =  2 ,  numblk-1 
mini(i)  =  maxi(i-l)+l 
maxi(i)  =  mini ( i ) +idim-l 
minj (i)  =0 
maxj(i)  =  jmax 
mink(i)  =  0 
maxk ( i )  =  kmax 
5  continue 

mini(numblk)  =  maxi (numblk-l) +i 
maxi(numblk)  =  imax 
minj (numblk)  =  0 
maxj (numblk)  =  jmax 
mink (numblk)  =  0 
maxk (numblk)  =  kmax 
endif 

else 

write!*,*)  'Input  the  name  of  the  domain  decomposition  file' 
read!*,*)  domfile 

open!  UNIT=12,  FILE=domf ile,  STATUS= ' OLD ' ) 
read!  12,*)  numblk 
do  10  i  =  1,  numblk 

read ( 12 , * )  mini ( i ) , maxi ( i ) , minj ( i ) , maxj ! i ) , mink ( i ) , maxk ( i ) 

10  continue 

endif 

write!*,*)  ‘The  program  will  divide  Che  computational  domain', 

&  '  into',  numblk,  '  subdomains* 
c 

c  Now  build  the  block  Cable.  First  determine  the  neighbors  of  each 
c  of  Che  blocks, 

c 

c  **NOTE:**  Current  implementation  of  domain  decomposition  is  trivial, 
c  This  code  is  really  unnecessary,  but  will  be  required  when  a  more 
c  general  decomposition  structure  is  implemented, 

c 

do  100  i  =  1,  numblk 

if!  mini(i)  .eq.  imin  )  then 
iblktb!i,l)  =  0 
else 

do  20  j  =  1,  numblk 

if!  maxi!j)  -eq.  mini(i)-l  .and.  maxj(j)  .eq.  maxj(i) 

&  .and.  maxk{j)  .eq.  maxk(i)  )  then 

iblktb(i,l)  =  j 
goto  25 
endif 

20  continue 

25  endif 

if!  maxi(i)  .eq.  imax  )  then 
iblkcb(i,2)  =  0 
else 

do  30  j  =  1,  numblk 

if!  mini(j)  .eq.  maxi(i)+l  .and.  maxj(j)  .eq.  maxj(i) 
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Si  .and.  maxkij)  .eq.  raaxkii;  )  Chen 

iblkcb!i,2)  =  j 
goto  35 
endif 

30  continue 

35  endif 

if{  minj(i)  .eq.  jmin  )  then 
iblkcb(i,3)  =  0 
else 

do  40  j  =1,  numblk 

if(  maxj(j)  .eq.  minjiil-l  .and.  maxi(i)  .eq.  maxi(j) 
Si  .and.  maxk(j)  .eq.  maxk(i)  )  Chen 

iblkcb(i,3)  =  j 
goto  45 
endif 

40  continue 

45  endif 

if(  maxj ( ./  .eq.  jmax  )  Chen 
iblktb (  . , 4 )  =0 
else 

do  50  j  =  1,  numblk 

if(  minj(])  .eq.  maxj(i)+l  .and.  maxi(i)  .eq.  maxi ( j ) 
&  .and.  maxk(j)  .eq.  maxk(i)  )  then 

iblkcb(i,4)  =  j 
goto  55 
endif 

50  continue 

55  endif 

if (  mink(i)  .eq.  kmin  )  then 
iblktb (i, 5)  =  0 
else 

do  60  j  =  1,  numblk 

if(  maxk(j)  .eq.  mink(i)-l  .and.  maxi(i)  .eq-.  maxi(j) 
&  .and.  maxj(j)  .eq.  maxj(i)  )  then 

iblktb (i, 5)  =  j 
goto  65 
endif 

60  continue 

65  endif 

if(  maxk(i)  .eq.  kmax  )  then 
iblktb (i, 6)  =  0 
else 

do  70  j  =1,  numblk 

if(  mdnk(j)  .eq.  maxk(i)+l  .and.  maxi(i)  .eq.  maxi{j) 
&  .and.  maxj(j)  .eq.  maxj(i)  )  then 

iblktb (i, 6)  =  j 
goto  75 
endif 

70  continue 

75  endif 

100  continue 

c 

c  Now,  load  the  maximum  i  index  for  each  block.  Note  that  this 
c  assumes  that  the  minimum  i  index  for  each  block  is  1  and 

c  that  the  minimum  j  and  k  indices  are  0.  This  apparent 

c  discrepancy  is  to  minimize  conversion  pain  at  this  time, 
c 

do  110  i  =  1,  numblk 

iblktb (i, 7)  =  maxi (i) -mini (i) +1 
iblktb (i, 8)  =  maxj (i) 
iblktb (i, 9)  =  maxk(i) 

110  continue 


RETURN 

END 


D.2  Loop  Index  Computation  Subroutine  (subroutine  indxcp) 


This  subroutine  uses  the  block  table  constructed  in  subroutine  domain  to  determine 
the  proper  loop  values  for  the  i  coordinate  direction  in  block  coordinates  given  the  loop  indices 
in  computational  coordinates.  While  no  decomposition  was  performed  in  either  the  j  or  k  coor¬ 
dinate  directions,  this  subroutine  can  easily  be  modified  to  allow  for  such.  The  subroutine 
parameters  necessary  to  provide  for  this  modification  are  already  incorporated  into  the  sub¬ 
routine  argument  list. 


f!ndft  bating  for  Siihrnutine  Dnirmin 


C’ 

c*  This  subroutine  calculates  the  proper  indexes  for  all  program 
c*  loops  given  the  minimum  and  maximum  coordinates  in  computational 
c*  coordinates.  Values  are  returned  in  the  ()blhmn  and  (jblkmx 
c’  subroutine  parameters, 
c* 

QTr*W*ir*iriif»******»**1r**W****»»*****«******»**»*«*»*»**Tr*»********»»**»» 


subroutine  indxcp ( iblkniti/  ilft,  ire,  jlfc ,  jrt , klf t .  kre, 

&  iblkmn,  iblkmx,  jblkmn,  jblkmx,  kblkmn,kblkxnx) 

integer  iblknm,  ilft,  irt,  jlft,  jrt,  klft,  krt, 

Sc  iblkmn,  iblkmx,  jblkmn,  jblkmx,  kblkmn,  kblkmx 

INCLUDE  ' commons ‘ 


c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 

c 


There  are  six  cases  to  test  for  for  each  coordinate.  They  are: 


CASE  1:  The  min  and  max  coordinates  are  both  less  than  the 
minimum  coordinate  for  that  block.  Then  that  block 
does  not  participate  in  the  loop. 


CASE  2:  The  min  and  max  coordinates  are  both  greater  than  the 
maximum  coordinate  for  that  block.  Then  that  block 
does  not  participate  in  the  loop. 


CASE  3:  The  min  and  max  coordinates  are  both  within  the  block. 
Both  coordinates  are  computed. 


CASE  4:  The  min  coordinate  is  within  the  block  and  the  max 

coordinate  is  outside  the  block.  The  min  coordinate 
is  computed  and  the  max  coordinate  is  set  to  the  max 
block  coordinate. 


CASE  5:  The  min  coordinate  is  less  than  the  min  coordinate  of 
the  block  and  the  max  coordinate  is  greater  than  the 
max  coordinate  of  the  block.  Then  the  full  block 
participates  in  the  loop. 

CASE  6:  The  min  coordinate  is  less  than  the  min  coordinate  of 
the  block  and  the  max  coordinate  is  within  the  block. 
Then  the  min  coordinate  is  set  to  the  min  block 
coordinate  and  the  max  coordinate  is  calculated. 
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C  NOTE  THAT  ALL  THESE  CASES  ASSUME  A  FORWARD  SWEE?  ..TTK  THE  MA:< 
c  COORDINATE  OF  GREATER  VALUE  THAN  THE  MINIM’.'M  "jORDINATE 


c 

c  Case  1 
c 

if(  ire  .Ic.  mini  ( iblknin)  )  then 
iblkmn  =  0 
iblkmx  =  -I 
goto  1000 
endi  f 

c 

c  Case  2 
c 

if(  ilft  .gc.  maxi ( iblknm)  )  then 
iblkmn  =  0 
iblkmx  =  -1 
goto  1000 
endif 

c 

c  Case  3 
c 

if(  lift  .ge.  mini (iblknm)  .and.  irt  .le.  maxi(iblknm)  ) 
iblkmn  =  ilft-mini ( iblknm) +1 
iblkmx  =  irt-mini ( iblknm) +1 
goto  1000 
endif 

c 

Case  4 

if(  ilft  .ge.  mini (iblknm)  .and.  irt  .gt.  maxi (iblknm)  ) 
iblkmn  =  ilft-mini (iblknm) tl 
iblkmx  =  iblktb( iblknm, 7) 
gotolOOO 
endif 

Case  5 

if(  ilft  .It.  mini(iblknm)  .and.  irt  .gt.  maxi(iblknm)  ) 
iblkmn  =  1 

iblkmx  =  iblktb( iblknm, 7) 
goto  1000 
endif 


Case  6 
c 

if(  ilft  .It.  mini(iblknm)  .and.  irt  .le.  maxi(iblknm)  ) 
iblkmn  =  1 

iblkmx  =  irt -mini { iblknm) +1 
goto  1000 
endif 

1000  continue 
RETURN 
END 


then 


then 


then 


then 
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D.3  Offset  Calculation  Subroutine  (subroutine  offset) 

Tliis  subroutine  is  used  to  calculate  a  new  block  coordinate  given  from  a  block  coordi¬ 
nate  and  an  offset  value.  A  discussion  of  the  requirement  for  this  subroutine  appears  in  Sec¬ 
tion  B.2.1.  The  subroutine  uses  the  coordinate  transformation  relations  given  by  equations 
39-40.  Should  the  method  of  domain  decomposition  change,  then  those  equations  and  this 
subroutine  must  necessarily  be  modified. 


Code  Listing  for  Subroutine  Qffaftt 


^******»»**ir»*<rif**********»wiri»w»**********inr*w*irw*w**»**»********w**w* 

C* 

c*  This  subroutine  calculates  a  new  block  coordinate  given  a  block 
c*  coordinate,  (L,i),  and  an  offset  direction  and  number  of  offset 
c*  points, 
c* 

^**»*<r*********»*»**»w********ilrir***»»**»iHr***ilf*1Hf»*****»*»***»**»**'*** 


SUBROUTINE  offset!  iblknm,  i,  idir,  inumpt  ) 
'.nteger  iblknm,  i,  idir,  inumpt 
INCLUDE  ‘params' 


c  Convert  the  block  and  coordinate  to  computational  coordinates, 
c 

iglob  =  (iblknm-1) *idim  +  i  -  ighost 
c 

c  Calculate  the  new  global  coordinate, 
c 

iglob  =  iglob  +  idir*inumpt 
c 

c  Convert  the  new  coordinate  back  into  block  coordinates, 
c 

iblknm  =  INT( (iglob+ighost-.OOl) /idim) +1 
i  =  iglob+ ighost  -  ( iblknm-1) * idim 

RETURN 

END 


DA  Coordinate  Transformation  Calculation  Subroutine  (subroutine  gl2blk) 

This  subroutine  performs  the  transformation  calculation  for  converting  computational 
coordinates  into  block  coordinates.  Like  subroutine  offset,  it  uses  the  transformation  equa¬ 
tions. 
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Code  Liating  for  Suhroutina  glb2hlk 


^»'****»»*«»*«*»»«»»**'#**tt'*lr**irir*»**»*»lr»»ir»***«*»»»»****»»»»*»*-**»ii 

C* 

c*  This  subroutine  converts  a  computational  i  coordinate  into  the 
*  corresponding  block  coordinate. 


SUBROUTINE  glb2blk(  iblknm,  i  ) 

integer  iblknm,  i 

INCLUDE  'paraims' 

temp  =  i  +  ighost- .001 

iblknm  =  INT (( i+ighost- . 001) /idim) +1 
i  =  i+ighost  -  ( iblknm-1) »idim 

RETURN 

END 
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